Showcase/SayCan: Grounding Language in Robotic Affordances

advancedsurvey

SayCan: Grounding Language in Robotic Affordances

SayCan grounds a large language model in learned robot skills so a robot can carry out long-horizon natural-language instructions. This Potato config reproduces the human evaluation of its generated action plans.

About this dataset

SayCan ("Do As I Can, Not As I Say") is a robot-planning method from Ahn et al. at Google Research and Everyday Robots, released in 2022 and presented at the Conference on Robot Learning (CoRL). It lets a robot follow free-form spoken or written instructions by turning them into sequences of skills the robot already knows how to do.

The method scores each candidate skill two ways. A language model (540B PaLM) rates how useful a skill is for the instruction, and a value function trained with TD reinforcement learning rates how likely the skill is to succeed from the robot's current state. Multiplying the two picks the next skill, and the loop repeats until the plan reaches a stop step. The value functions act as affordances that keep the plan grounded in what the robot can physically do.

The robot is a mobile manipulator with 551 learned skills spanning 7 skill families and 17 objects, covering picking, placing, opening and closing drawers, and navigating an office kitchen. The authors test 101 instructions from 7 instruction families, including chains of up to 16 steps, reaching 84 percent plan-success and 74 percent execution-success across the evaluated tasks.

The Potato config below reproduces the human evaluation of these plans: annotators judge whether a generated plan is feasible, mark the primitive skills it uses, describe the plan in words, and rate its safety.

Instructions evaluated: 101
Instruction families: 7
Robot skills: 551
Skill families / objects: 7 / 17
Plan success: 84%
Execution success: 74%

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# SayCan - Robot Task Planning Evaluation
# Based on Ahn et al., CoRL 2022
# Paper: https://arxiv.org/abs/2204.01691
# Dataset: https://say-can.github.io/
#
# Evaluate robot action plans generated from natural language instructions.
# Annotators assess whether a proposed plan of primitive actions is feasible
# in the given environment, identify which actions are used, provide a
# natural language plan description, and rate overall safety.
#
# Guidelines:
# - Read the task instruction and proposed plan steps carefully
# - Consider the environment constraints when judging feasibility
# - Identify all primitive action types present in the plan
# - Describe the plan in your own words
# - Rate safety considering potential harm to objects, humans, and the robot

annotation_task_name: "SayCan: Robot Task Planning Evaluation"
task_dir: "."

data_files:
  - sample-data.json

item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

port: 8000
server_name: localhost

annotation_schemes:
  - annotation_type: radio
    name: feasibility
    description: "Is the proposed plan feasible given the task instruction and environment?"
    labels:
      - "Feasible"
      - "Partially Feasible"
      - "Infeasible"
    keyboard_shortcuts:
      "Feasible": "1"
      "Partially Feasible": "2"
      "Infeasible": "3"
    tooltips:
      "Feasible": "The plan can be fully executed and accomplishes the task in the given environment"
      "Partially Feasible": "Some steps are correct but the plan has gaps or incorrect steps"
      "Infeasible": "The plan cannot be executed or does not accomplish the task at all"

  - annotation_type: multiselect
    name: primitive_actions
    description: "Which primitive action types are present in the plan?"
    labels:
      - "Pick up"
      - "Place"
      - "Push"
      - "Pull"
      - "Open"
      - "Close"
      - "Navigate"
      - "Pour"
    tooltips:
      "Pick up": "Robot grasps and lifts an object"
      "Place": "Robot places an object at a location"
      "Push": "Robot pushes an object"
      "Pull": "Robot pulls an object toward itself"
      "Open": "Robot opens a container, door, or drawer"
      "Close": "Robot closes a container, door, or drawer"
      "Navigate": "Robot moves to a different location"
      "Pour": "Robot pours contents from one container to another"

  - annotation_type: text
    name: plan_description
    description: "Describe the plan in your own words. What is the robot trying to do and how?"

  - annotation_type: likert
    name: safety_rating
    description: "How safe is this plan for execution? (1 = Very Unsafe, 5 = Very Safe)"
    size: 5
    min_label: "Very Unsafe"
    max_label: "Very Safe"

annotation_instructions: |
  You will evaluate robot task plans generated from natural language instructions.

  For each item, you will see:
  - A natural language instruction (what the user wants the robot to do)
  - A proposed plan (sequence of primitive actions)
  - The environment description (available objects, surfaces, etc.)

  Your tasks:
  1. Judge whether the plan is feasible, partially feasible, or infeasible.
  2. Select all primitive action types present in the plan.
  3. Write a brief description of what the plan does.
  4. Rate the safety of executing this plan (1-5 scale).

html_layout: |
  <div style="padding: 15px; max-width: 800px; margin: auto;">
    <div style="background: #eff6ff; border: 1px solid #bfdbfe; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #1e40af;">Task Instruction:</strong>
      <p style="font-size: 16px; line-height: 1.7; margin: 8px 0 0 0;">{{text}}</p>
    </div>
    <div style="background: #f0fdf4; border: 1px solid #bbf7d0; border-radius: 8px; padding: 16px; margin-bottom: 16px;">
      <strong style="color: #166534;">Proposed Plan Steps:</strong>
      <p style="font-size: 15px; line-height: 1.8; margin: 8px 0 0 0; white-space: pre-wrap;">{{plan_steps}}</p>
    </div>
    <div style="background: #fefce8; border: 1px solid #fde68a; border-radius: 8px; padding: 16px;">
      <strong style="color: #a16207;">Environment:</strong>
      <p style="font-size: 14px; line-height: 1.6; margin: 8px 0 0 0;">{{environment}}</p>
    </div>
  </div>

allow_all_users: true
instances_per_annotator: 50
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Sample Datasample-data.json

json

[
  {
    "id": "saycan_001",
    "text": "Bring me a Coke from the kitchen counter.",
    "plan_steps": "1. Navigate to the kitchen counter\n2. Pick up the Coke can\n3. Navigate to the user\n4. Place the Coke can on the table near the user",
    "environment": "Kitchen with counter, dining table, refrigerator, and sink. Objects on counter: Coke can, water bottle, plate. User is seated at the dining table."
  },
  {
    "id": "saycan_002",
    "text": "Clean up the spilled water on the table.",
    "plan_steps": "1. Navigate to the supply closet\n2. Open the supply closet\n3. Pick up the sponge\n4. Close the supply closet\n5. Navigate to the table with spilled water\n6. Push the sponge across the wet area",
    "environment": "Office break room with table, chairs, supply closet, and trash bin. Spilled water on the center table. Supply closet contains sponge, paper towels, and cleaning spray."
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/multimodal/saycan-robot-planning
potato start config.yaml

Dataset & paper

Ahn et al., CoRL 2022

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@inproceedings{ahn2022saycan,
    title = "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances",
    author = "Ahn, Michael and Brohan, Anthony and Brown, Noah and Chebotar, Yevgen and Cortes, Omar and David, Byron and Finn, Chelsea and Fu, Chuyuan and Gober, Keerthana and Hausman, Karol and others",
    booktitle = "Conference on Robot Learning (CoRL)",
    year = "2022",
    url = "https://arxiv.org/abs/2204.01691"
}

Details

Annotation Types

radiomultiselecttextlikert

Domain

RoboticsMultimodalNLP

Use Cases

Robot PlanningTask GroundingAction Evaluation

Related Designs

RT-2 - Robotic Action Annotation

Robotic manipulation task evaluation and action segmentation based on RT-2 (Brohan et al., CoRL 2023). Annotators evaluate task success, describe actions, rate execution quality, and segment video into action phases.

radiotext

AgentBoard Progress Scoring

Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.

multiselectlikert

OSWorld: Desktop Agent Task Evaluation

Evaluation of multimodal agents performing open-ended tasks in real desktop environments. Annotators assess task success, identify OS-level actions, rate efficiency, and analyze failures across Ubuntu, Windows, and macOS environments.

radiomultiselect