SOFAR: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

type

status

date

slug

summary

Introduction

Abstract

spatial understanding —understand object orientations

semantic orientation — using natural language, reference-frame-free

OrienText300K dataset

positional constraints + orientational constraints

language-grouded orientation that bridges spatial reasoning and object manipulation

From Position Awareness to Orientation Awareness

单纯的位置感知是不够的，需要调整物体方向来执行更复杂的操作

From Orientation to Semantic Orientation

机器人得到的指令很可能是自然语言，应当让物体的方向矢量和 open- vocabulary 的语言描述相联系

bridge geometric reasoning with functional semantics

e.g. “handel” 叉子和 “plug in” U 盘都隐含了一层方向的含义

spatial awareness 两个关键问题：

让模型具备 semantic orientation 的相关知识

用 PointSO，a generalizable cross-model 3D Transformer

a robust and versatile framework for open-world spatial orientation understanding
用 OrienText300K 数据集训练

about the dataset

把 semantic orientation 整合进 VLM 中

SoFar, a integrated reasoning system

SoFar 是一个巨大的缝合怪。Florence-2 (图像特征提取和语义理解) 和 SAM (通用图像分割)解决 object position 的问题，PointSO 解决理解尤其是 orientation 的问题。

输入 RGB-D 的图片，通过 SoFar 输出 orientation-aware 3D scene graph。把 RGB-D 图和 scene graph 一起输入 VLM，输出 chain-of-thought spatial reasoning，生成同时关于位置和方向的指令，这些指令指导 robot actions。

benchmark 使用 Open6DOR V2。

关于这个 benchmark

Semantic Orientation

connecting language and object orientation

Definition of Semantic Orientation

Semantic Orientation: Given an object and a description , the corresponding semantic orientation is an object-centric direction represented as a unit vector semantically matching the description.

Robotic Manipulation via Semantic Orientation

从指令中提取 task-related semantic orientation description

从 initial observation 中提取当前的 semantic orientation，计算需要旋转的角度

Instance-level/Category-level/Cross-category Orientation open-world & open-ended

OrienText300K: Orientation-Text Paired Data at Scale

a newly curated 3D model dataset with diverse and extensive semantic orientation labels for training our language-conditional orientation model

暂时略

PointSO: A Cross-Modal 3D Transformer for Semantic Orientation Prediction

PointSO, a plain Transformer-based architecture with cross-model 3D-language fusion as orientation model.

从图中可以看出，PointSO 输入是物体点云和语言描述，输出是 semantic orientation

3D and Language Embeddings

给定点云

用 farthest point sampling (FPS) 采样个点
用 KNN (K-Nearest Neighbors) 将周围的相邻点分组，用 local geometric extraction network (比如 lightweight PointNet) 做 point feature embedding
用 MLP head 把 [CLS] tokens 映射到一个 prediction direction

对于 language inputs，采用 OpenAI 的 CLIP 转换为 global tokens 作为 cross-model fusion model 的输入

Cross-Model Fusion

目标是把文本的全局特征嵌入到 3D Transformer (PointOS) 中，让 3D 点云数据和文本语义融合
实现方法包括：

Cross-Attention
adapter
concatenation

文章发现最好的方法是直接把文本 tokens 的特征和点云 tokens 加起来（难道不是恰恰证明现在的模型提升空间还极大咩）

Optimization

表示 PointOS 由参数化 (CLIP 固定不动，不算在参数里面)
OrienText300K 构成：

常见物体点云：
每个物体的语言标签：
ground truth semantic orientation set：

损失函数 (negative cosine similarity)：

SoFar

semantic orientation bridges spatial reasoning and object manipulation

PointSO 实现了 object-centric 的空间方向理解。

但是单靠 PointSO 要上升到 scene-level 到空间推理还有困难 (e.g. orientation-aware visual question answering VQA in digital world, robot manipulation in physical world)

这也是目前引入 VLM 到原因：

To enable such applications, we build an integrated reasoning system where a powerful VLM acts as an agent and reasons about the scene while communicating with off-the-shelf models including PointSO and SAM.

类似，VLM 为架构提供基于视觉和语言的强大推理能力，提供处理开放场景的能力。整个 SoFar 像一个 agent，调用如 PointSO、 SAM、VLM 这样的 foundational models 来处理空间推理的问题。

pipeline:

输入语言指令，用 VLM 获得 task-oriented object phases and semantic orientation descriptions；输入 RGB-D 图像，用 Florence2 + SAM (有更好的方案) 获得分割物体点云和深度估计，再通过 PointSO 获得 semantic orientations。也就是上图中间编号的三个箭头做的事。最终得到的是一个 orientation-aware 3D scene graph (JSON-format)。

Orientation-Aware Scene Graph from RGB-D

大致流程前面已经介绍了，这里是更细致的说明。

Task-Oriented Object Segmentation

输入：

language query
RGB-D 图像

用一个 VLM (如 Florence2) 抽取 task-oriented object phrase set
把任务相关的物体提取结果作为 SAM 的语言条件做图像分割，生成点云；每个物体点云赋予唯一 ID，用于后续 VLM 的 Set-of-Mark(SoM) 标记
再用一次 VLM，给每个物体生成任务相关的语言描述
用 PointSO 给每个物体生成 semantic orientation set

Orientation-Aware 3D Scene Graph

对于前面的物体点雨
构建一个带方向的场景图
每个节点有如下属性：

物体名称和 ID
物体三维坐标
物体 3D bounding box size
与物体对应的和

每条边记录物体间的空间关系

Spatial-Aware Task Reasoning

Chain-of-Thought Spatial Reasoning

把任务简单抽象为改变物体的位置和朝向；希望 VLM 输出 goal transformation
CoT reasoning

根据 language query 和 object node 分析场景
推理想要的目标位姿 (坐标和朝向)
输出 task-desired object position and semantic orientation
根据初态和终态算一个 6-DoF transformation matrix (包括一个 translation transformation 和一个 rotation transformation )

Low-Level Motion Execution

包含 task-oriented grasping 和 task-aware motion planning

根据物体点云，用 GSNet 生成 grasp pose candidate
根据 GSNet 给出的得分和抓取方向与 z 轴的夹角选择最好的抓取位姿
上面的位移和旋转变换定义了从 grasp pose 到 placement pose
用一个开源的 motion planning module 做动作规划 (比如 OMPL)

set the initial joint position as the midpoint to achieve smooth motion while reducing collisions between the manipulated object and the environment

demo 效果

🗽SOFAR: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation