pi0 : Our First Generalist Policy (Part. 1)

type

status

date

slug

summary

Part 1: Paper

Introduction

这张图描述了整个 policy 的基本流程。核心的 VLA 模型用一个 pre-trained VLM 模型（我的理解是提供强大的推理能力）和一个 action expert 拼接而成。用 cross-embodiment 的数据做训练，用 high-quality 的数据做 fine-tuned。

以 VLM 为基础的原因

By basing our model on a VLM, we inherit the general knowledge, semantic reasoning, and problem-solving abilities of language and vision-language models.

论文提出了现在 generalist robot policies 的三个 bottlenecks ，分别是

very large scale

a right model architectures make use of diverse data source, represent intricate and subtle behaviors

a right training recipe

论文分别给出了如下解决方案

Problem 1：以已经预训练好的 VLM 为 base

Problem 2：

diverse data: VLM, cross-embodiment training
intricate and subtle behaviors: an action chunking architecture with flow matching

Problem 3：pre-training/post-training separation

pre-training/post-training separation

Intuitively, training only on high-quality data does not teach the model how to recover from mistakes, since mistakes are rarely seen in such data. Training on only lower-quality pretraining data does not teach the model to act efficiently and robustly. Combining both provides the desired behavior: the model attempts insofar as possible to act in a manner similar to the high-quality data, but still has a repertoire of recoveries and corrections that it can deploy in the case of a mistake.

Related Work

之前的模型会采用 autoregressive discretization 来预测动作，这里使用的是一种 diffusion model 的变体，称为 flow matching（主要卖点）。flow matching 可以提供高精度和多模态的能力。

Flow Matching for Generative Modeling

We introduce a new paradigm for generative modeling built on Continuous Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale. Specifically, we present the notion of Flow...

https://arxiv.org/abs/2210.02747

Diffusion Meets Flow Matching

Flow matching and diffusion models are two popular frameworks in generative modeling. Despite seeming similar, there is some confusion in the community about their exact connection. In this post, we aim to clear up this confusion and show that <i>diffusion models and Gaussian flow matching are the same</i>, although different model specifications can lead to different network outputs and sampling schedules. This is great news, it means you can use the two frameworks interchangeably.

https://diffusionflow.github.io/

diffusion model for action generation

论文里列举了两篇

RoboAgent: Generalization and Efficiency in Robot Manipulation via...

The grand aim of having a single robot that can manipulate arbitrary objects in diverse settings is at odds with the paucity of robotics datasets. Acquiring and growing such datasets is strenuous...

https://arxiv.org/abs/2309.01918

Scaling Diffusion Policy in Transformer to 1 Billion Parameters...

Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural...

https://arxiv.org/abs/2409.14411

训练模式与之前不同

… we train our model via a diffusion-style (flow matching) loss applied on individual sequence elements, in lieu of the standard cross-entropy loss for decoder-only transformers.

… we use a separate set of weights for the tokens corresponding to diffusion.

Overview

Overview of pi0’s framework

We start with a pre-training mixture, which consists of both our own dexterous manipulation datasets and open-source data. We use this mixture to train our flow matching VLA model, which consists of a larger VLM backbone and a smaller action expert for processing robot states and actions. The VLM backbone weights are initialized from PaliGemma, providing representations learned from large-scale Internet pre-training. The resulting π 0 model can be used to control multiple robot embodiments with differing action spaces to accomplish a wide variety of tasks.

训练总体分两步：pre-training 希望得到一个有多种能力、良好泛化性的模型，post-training 希望让动作更精细

The model

整个模型的骨架是 VLM。VLM 的底层是 language model transformer。在机器人上 image encoder 做的是将机器人观察到的图像 embed 到 language tokens 的语义空间中。

late fusion VLM recipe

VLM 是需要搞懂的，论文中列举了 3 篇

Flamingo: a Visual Language Model for Few-Shot Learning

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family...

https://arxiv.org/abs/2204.14198

PaLM-E: An Embodied Multimodal Language Model

Project page for PaLM-E: An Embodied Multimodal Language Model.

https://palm-e.github.io/

Visual Instruction Tuning

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal...

https://arxiv.org/abs/2304.08485

在这个 backbone 上扩展，对于机器人我们需要加入特定的输入 proprioceptive state （本体状态）和特定的输出 robot action。

使用 conditional flow matching 来为连续的动作分布函数建模，基本框架遵循 Transfusion，用 flow matching loss 来监督 “tokens corresponding to continuous outputs”，用 cross-entropy loss 来监督 “tokens corresponding to discrete outputs”。

Transfusion

trains a single transformer using multiple objective

Transfusion: Predict the Next Token and Diffuse Images with One...

We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with...

https://arxiv.org/abs/2408.11039

💡

对于机器人的 state 和 action 单独使用一组不同的参数（应该就是指把 action 的预测分离出来）能获得更好的性能，类似 a mixture of experts（使用两套 mixture elements，第一套针对 image 和 text 的输入，第二套针对 robotics-specific inputs and outputs）。论文中将这一套参数称为 action expert。

Mathematical representation

训练部分

问题的本质是想建模出一个 action 的概率密度函数，对这个函数做随机抽样来预测下一个 action。

其中，action chunk 表示未来的一系列动作；

observation 。是 RGB 图像输入，是 language tokens 序列，是 a vector of joint angle。

用各自的 encoder + a linear projection layer 将 image 和 state 映射到和 language tokens 一样的 embedding space 中。

接下来的数学推导有点复杂。

对于中的每个动作，有一个对应的输入给 action expert 的 action token。使用 conditional flow matching 的 loss（这玩意儿就离谱）来监督

与其说生成过程是一个 conditional flow matching ，我觉得就是扩散模型。

我们从标准高斯分布中采样一个随机噪声，与原始动作线性组合得到加噪后的动作。根据 flow matching ，我们训练一个网络，该网络可以生成，这个 velocity 要和目标向量场尽可能匹配，其中（不太理解）。

action expert 采用 full bidirectional attention mask ，让所有的 action token 能互相关注（类似于文本生成中 token 上下文的信息都能注意到）。

flow matching 时间步是从分布中采样得到，更注重嘈杂的时间步（在相对嘈杂的部分采样更多），有助于提升模型处理噪声的能力。

推理部分

预测一个动作是从噪声变换到目标动作，从 flow matching 时间步的角度来看是。起点，中间由模型预测的 velocity 给出变换方向，变化量是 velocity 的积分。

从数值计算的角度，使用 forward Euler integration rule ，论文里取，即将积分分为小步去做，每步

Model Architecture Details

在论文的 Appendix B 中，作者给出了对模型各部分更详细的描述，包括的创新点，也是对论文对总结。

Additional inputs and outputs

在 PaliGemma 原始的输入基础上，为适配机器人增加了描述本体状态的。除此之外还有 noisy action chunk 。

输入

图像序列，语言提示，本体状态表征当前状态

noisy action chunk ，未来一段时间（， the action horizon ）的动作规划，但此时是有噪声的，有待 flow matching 做去噪。

输出

将输入一股脑塞进一个 transformer 中， transformer 的任务是从当前状态的信息中提取对未来动作有用的信息，并会特别注意 noisy action chunk 。

Incorporating the flow matching timestep

将 action chunk 输入 transformer 中，需要将它映射到一个统一的 embedding space 。除此之外， action chunk 有一个自己的 flow matching 时间步，用于标记 flow matching 的过程。

映射通过 MLP （Multilayer Perceptron ，多层感知机）实现，可以用如下公式表示

其中，对做 positional encoding ，将转化为一个维度为（是 embedding dimension ）的向量。是一个线性变换矩阵，将 noisy action 映射到一个维度为的空间。 concat 操作将两个向量拼接起来，形成维度为的向量，再用将该向量映射回维。 swish 是一个激活函数，能够提供更平滑的非线性变换。最后一个又映射了一次（意义是？？？）。

极简 MLP

太帅了这视屏

What are MLPs (Multilayer Perceptrons)?

Learn about watsonx: https://ibm.biz/BdvxRg Ever wondered how AI is able to mimic human thought in order to perform complex tasks? In this video David Adeyemi explains how the MLP, or multilayer perceptron, provides a method for modeling our own ability to learn within artificial intelligence. #AI #Software #ITModernization #DeepLearning #MachineLearning #watsonx