by Huang Huang*, Fangchen Liu*, Letian Fu*, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, Pieter Abbeel at UC Berkeley and Meta (*equal contribution).
[Paper] | [Project Page]
This repo contains the official re-implementation for Otter: A Vision-Language-Action Model with Text-Aware Feature Extraciton. The experiments in the paper are based on the original repo here implemented in Jax.
Further information please contact Huang Huang, Fangchen Liu, Letian Fu, or post an issue on Github!
- OpenCLIP integration to allow training and inference with more powerful CLIP models.
- 2025-03-05: Initial release.
# create conda env
conda create -n otter python=3.10 -y
conda activate otter
# install torch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# download repo
git clone https://github.com/Max-Fu/otter.git
cd otter
pip install -e .We provide a simple interface for the OTTER model. For more details, please refer to the otter interface.
from otter.policy.otter_interface import OtterInference
from PIL import Image
import numpy as np
policy = OtterInference(
model_ckpt_folder : str = "path/to/model/checkpoint",
ckpt_id : int = 60000,
)
image_primary : Image.Image = ...
image_wrist : Image.Image = ...
# action is a numpy array of shape (action_horizon, action_dim)
action = policy(
images = {
"image_primary" : image_primary,
"image_wrist" : image_wrist
},
text : str = ..., # language prompt
proprio : np.ndarray = ..., # proprioception (6,)
gripper : np.ndarray = ..., # gripper position (1,)
)
...
# reset the policy's cache upon finishing a rollout
policy.reset()We additionally provide a script for rolling out the OTTER model in the DROID environment. For more details, please refer to the droid inference script.
python script/droid_inference.py \
--model-ckpt-folder path/to/model/checkpoint \
--ckpt-id 60000 We host the OTTER dataset on Hugging Face. They are TFDS to support pre-training on Open X-Embodiment. Alternatively, there is a converted LeRobot version of the dataset here to fine-tune Pi0, which uses joint positions for proprioception and joint velocities for action. The fine-tuning scripts are provided here.
# first install huggingface-cli
pip install -U "huggingface_hub[cli]"
# download the datasets
mkdir -p dataset
pushd dataset
huggingface-cli download mlfu7/icrt_pour --repo-type dataset --local-dir .
huggingface-cli download mlfu7/icrt_drawer --repo-type dataset --local-dir .
huggingface-cli download mlfu7/icrt_poke --repo-type dataset --local-dir .
huggingface-cli download mlfu7/icrt_pickplace_1 --repo-type dataset --local-dir .
huggingface-cli download mlfu7/icrt_stack_mul_tfds --repo-type dataset --local-dir .
huggingface-cli download mlfu7/icrt_pickplace --repo-type dataset --local-dir .
popdWe use the following command to train the OTTER model. We support multi-GPU training on a single node.
TF_FORCE_GPU_ALLOW_GROWTH=true torchrun --nproc_per_node=2 --master_port=1255 script/train.py --logging-cfg.log-name <log_name> --logging-cfg.output-dir <output_dir> --shared-cfg.batch-size 128To change the dataset paths and their subsampling ratios, please refer to the training args. To see all the available options,
python script/train.py --helpWe provide a script for visualizing the CLIP's visual patch feature's cosine similarity with the text features. For more detail, please refer to the script.
python script/clip_visualization.py --text "pour from the orange cup into the pink bowl" --image asset/droid_image.png You can also visualize with more powerful CLIP models with OpenCLIP. A good combination we find is the ViT-L-14 datacomp_xl_s13b_b90k version.
python script/open_clip_visualization.py --text "pour from the orange cup into the pink bowl" --image asset/droid_image.png This project is under the Apache 2.0 license. See LICENSE for details.
The code is based on Octo, MAE, CrossMAE, ICRT, OpenPi, DROID.
Please give us a star 🌟 on Github to support us!
Please cite our work if you find our work inspiring or use our code in your work:
@article{huang2025otter,
title={Otter: A Vision-Language-Action Model with Text-Aware Feature Extraciton},
author={Huang Huang and Fangchen Liu and Letian Fu and Tingfan Wu and Mustafa Mukadam and Jitendra Malik and Ken Goldberg and Pieter Abbeel},
journal={arXiv preprint arXiv:2503.03734},
year={2025}
}