Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation

University of California, Berkeley


Abstract

Learned language-conditioned robot policies often struggle to effectively adapt to new real-world tasks even when pre-trained across a diverse set of instructions. We propose a novel approach for few-shot adaptation to unseen tasks that exploits the semantic understanding of task decomposition provided by vision-language models (VLMs). Our method, Policy Adaptation via Language Optimization (PALO), combines a handful of demonstrations of a task with proposed language decompositions sampled from a VLM to quickly enable rapid nonparametric adaptation, avoiding the need for a larger fine-tuning dataset. We evaluate PALO on extensive real-world experiments consisting of challenging unseen, long-horizon robot manipulation tasks. We find that PALO is able of consistently complete long-horizon, multi-tier tasks in the real world, outperforming state of the art pre-trained generalist policies, and methods that have access to the same demonstrations.

summary

Policy Adaptation via Language Optimization (PALO)

summary

An overview of the PALO algorithm. (Left) We build off a pre-trained policy that has learned to follow low-level language instructions from a large dataset of expert demonstrations. (Middle) Given a new task and a few expert demonstrations, we use a VLM to propose candidate decompositions into subtasks. We optimize over these demonstrations, selecting the one that minimizes the validation error of the learned policy on these demonstrations. (Right) At test time, we condition the pre-trained policy on the selected decomposition to solve the task.


Examples

We provide example rollouts from PALO performing 8 different tasks. We include the low-level language decompositions instructions inferred by PALO on each frame of the video.


Sweep the mints to the right after putting the mushroom into the pot

Sweep the skittles into the bin using the swiffer after putting the mushroom in the container

Put the beet toy/purple thing into the drawer

Pry out the pot in the drawer using the ladle

Pour the contents in the scoop into the bowl

Make a salad bowl with corn and mushroom

Put the spoon in the cleaner while aligning it

Put the marker in the white box while aligning it


Evaluation

comparison against baselines

While evaluating on long-horizon and unseen skills tasks, PALO outperforms all conventional zero-shot generalization methods by 3x in terms of success rate, and improves upon conventional finetuning methods despite having only 5 expert demonstrations per task.

Unlike previous finetuning methods that take up to 5 hours to train given the dataset size, PALO is also computationally efficient, taking up only 470 seconds for optimization to finish.

comparison against ablations

Ablation studies show the all components of PALO are necessary for its success. We ablate the following:

  1. No high-level \(c_H\) conditioning for the learned policy via masking.
  2. No low-level \(c_L\) instruction conditioning via masking.
  3. Use fixed \(u\) in each trajectory to evaluate
  4. Generate \(c\) zero-shot without expert demonstrations.
  5. No VLM decomposition proposals by using only \(\ell\) within our policy.

BibTeX

@inproceedings{myers2024policy,
    title     = {Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation},
    author    = {Vivek Myers and Bill Chunyuan Zheng and Oier Mees and Sergey Levine and Kuan Fang},
    booktitle = {8th Annual Conference on Robot Learning},
    year      = {2024},
    url       = {https://openreview.net/forum?id=qUSa3F79am}
}