Skip to content
此页面尚未提供您所选语言的版本,当前显示英文版本。

Live Coding Agent Observation

Watch coding agents work in real time with pause, rollback, and branching. Three backends supported: Ollama for local models, Anthropic API, and Claude Agent SDK.

Live Coding Agent Observation

New in v2.4.0

Static trace annotation tells you what an agent did. Live observation tells you what an agent does in response to human guidance. Potato's live coding agent mode lets annotators watch a coding agent work in real time -- reading files, editing code, running tests -- and intervene at any point. Pause the agent, send new instructions, rollback to a previous checkpoint, or branch the trajectory to explore alternative approaches.

This produces richer annotation data than static traces alone. You get the full trajectory with timestamps, the annotator's interventions, branching decision points, and comparative data from alternative paths. This data is directly useful for training process reward models, preference models, and instruction-following evaluators.

Requirements

  • Python 3.10+
  • Git (the checkpoint system uses git commits)
  • One of the following agent backends:
    • Ollama for local model inference (no API key required)
    • ANTHROPIC_API_KEY for Anthropic API access
    • Claude Agent SDK for the full Claude Code agent experience

Backends

Potato supports three backends for running coding agents. Each backend runs the agent in a subprocess and streams its actions to the annotation interface in real time.

1. Ollama (Local Models)

Run coding agents locally with no API key required. Ollama provides fast inference for open-weight models. Best for development, testing, and situations where data cannot leave the local machine.

Setup:

bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
 
# Pull a coding-capable model
ollama pull qwen2.5-coder:7b
 
# Or a larger model for better performance
ollama pull deepseek-coder-v2:16b

Configuration:

yaml
agentic:
  enabled: true
  display_type: coding_trace
  live_agent:
    enabled: true
    backend: ollama
    model: qwen2.5-coder:7b
 
    ollama:
      host: "http://localhost:11434"    # Ollama server URL
      temperature: 0.2
      num_ctx: 8192                     # context window size
      num_predict: 2048                 # max tokens per response
      keep_alive: "5m"                  # keep model loaded in memory
 
    # Agent capabilities
    tools:
      - read_file
      - edit_file
      - write_file
      - bash
      - glob
      - grep
    max_steps: 50
    step_timeout_seconds: 60

2. Anthropic API

Use Claude models via the Anthropic API. Provides strong coding performance with tool use capabilities. Requires an API key.

Setup:

bash
# Set your API key
export ANTHROPIC_API_KEY="sk-ant-..."
 
# Or add to .env file
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .env

Configuration:

yaml
agentic:
  enabled: true
  display_type: coding_trace
  live_agent:
    enabled: true
    backend: anthropic
    model: claude-sonnet-4-20250514
 
    anthropic:
      api_key: ${ANTHROPIC_API_KEY}
      max_tokens: 4096
      temperature: 0.2
      system_prompt: |
        You are a coding assistant working on a software project.
        Read files before editing them. Run tests after making changes.
        Explain your reasoning before each action.
 
    # Agent capabilities
    tools:
      - read_file
      - edit_file
      - write_file
      - bash
      - glob
      - grep
    max_steps: 100
    step_timeout_seconds: 120

3. Claude Agent SDK

The Claude Agent SDK provides the full Claude Code agent experience, including automatic tool orchestration, context management, and multi-file reasoning. This is the most capable backend but requires the SDK to be installed.

Setup:

bash
# Install the Claude Agent SDK
pip install claude-agent-sdk
 
# Set your API key
export ANTHROPIC_API_KEY="sk-ant-..."

Configuration:

yaml
agentic:
  enabled: true
  display_type: coding_trace
  live_agent:
    enabled: true
    backend: claude_agent_sdk
 
    claude_agent_sdk:
      api_key: ${ANTHROPIC_API_KEY}
      model: claude-sonnet-4-20250514
      max_turns: 100
      permission_mode: auto           # auto-approve tool use
      enable_thinking: true           # show extended thinking
 
    max_steps: 100
    step_timeout_seconds: 180

Controls

The annotation interface provides four control actions that let annotators guide the agent's behavior.

Pause / Resume

Click Pause to halt the agent between steps. The agent finishes its current step and waits. The annotator can review the current state, examine files, and decide whether to let the agent continue or intervene. Click Resume to let the agent proceed.

yaml
live_agent:
  controls:
    pause_resume:
      enabled: true
      auto_pause_on_error: true      # pause when a command fails
      auto_pause_after_steps: 0      # pause after N steps (0 = disabled)
      keyboard_shortcut: "Space"

Send Instructions

While the agent is paused, annotators can send new instructions that redirect the agent. This is useful when the agent is going down the wrong path or when the annotator wants to test how the agent responds to guidance.

yaml
live_agent:
  controls:
    send_instructions:
      enabled: true
      placeholder: "Type instructions for the agent..."
      inject_as: system_message      # "system_message" or "user_message"
      keyboard_shortcut: "Enter"
      presets:
        - "Try a different approach"
        - "Read the error message more carefully"
        - "Check the test file for expected behavior"
        - "Revert your last change and try again"

Instructions are injected into the agent's conversation context. The inject_as option controls whether they appear as a system message (authoritative instruction) or a user message (conversational guidance).

Rollback

Rollback reverts the project to a previous git checkpoint. Every file change the agent makes is automatically committed, so the annotator can click any previous step in the timeline and roll back to that exact state. The agent's conversation context is also truncated to match.

yaml
live_agent:
  controls:
    rollback:
      enabled: true
      show_checkpoint_diff: true     # show what will be undone
      require_confirmation: true     # "Are you sure?" dialog
      keyboard_shortcut: "Ctrl+Z"

Branch and Replay

Branch and replay combines rollback with instruction sending. The annotator rolls back to a checkpoint and sends different instructions, creating a branching trajectory. This is powerful for collecting preference data: you can explore two different approaches from the same starting point and compare outcomes.

yaml
live_agent:
  controls:
    branch:
      enabled: true
      max_branches: 5                # maximum branches from any checkpoint
      branch_naming: auto            # "auto" or "manual"
      compare_view: true             # side-by-side branch comparison
      keyboard_shortcut: "Ctrl+B"

The branch comparison view shows two branches side by side, highlighting where they diverge. Annotators can rate which branch produced better results, generating preference pairs for DPO training.

Git Checkpoint System

The live agent mode uses git to track every file change. This provides reliable rollback, branching, and full change history.

How It Works

  1. Before the agent starts, Potato creates a new git branch named potato-session-{session_id}
  2. After every file change (edit, write, create, delete), Potato automatically commits with a descriptive message
  3. Each commit is tagged as a checkpoint that appears in the timeline
  4. Rollback uses git checkout to restore the working directory to any checkpoint
  5. Branching creates a new git branch from the checkpoint commit

Configuration

yaml
live_agent:
  git_checkpoints:
    enabled: true
    branch_prefix: "potato-session"
    commit_message_format: "Step {step}: {tool} {file_path}"
    auto_commit: true
    cleanup_on_complete: false       # delete session branches when done
    require_clean_working_dir: true  # fail if there are uncommitted changes

Manual Checkpoint Management

bash
# List all Potato session branches
git branch | grep potato-session
 
# View checkpoints for a session
git log potato-session-abc123 --oneline
 
# Clean up old session branches
python -m potato.cleanup_sessions --older-than 7d

Data Format

Input data for live coding agent tasks specifies the task description and optionally a starting file or directory:

json
{
  "id": "task_001",
  "task_description": "Fix the bug in src/parser.py where empty input causes a crash",
  "project_dir": "/path/to/project",
  "start_file": "src/parser.py",
  "test_command": "python -m pytest tests/test_parser.py -v",
  "context_files": [
    "src/parser.py",
    "tests/test_parser.py"
  ]
}
FieldRequiredDescription
idYesUnique task identifier
task_descriptionYesWhat the agent should do
project_dirYesPath to the project directory
start_fileNoFile to show the agent initially
test_commandNoCommand to verify the fix
context_filesNoFiles to pre-load into the agent's context

Configuration Reference

Complete configuration for a live coding agent observation task:

yaml
task_name: "Live Coding Agent Observation"
task_dir: "."
 
data_files:
  - "data/coding_tasks.jsonl"
 
item_properties:
  id_key: id
  text_key: task_description
 
agentic:
  enabled: true
  display_type: coding_trace
 
  coding_trace_display:
    diff_style: unified
    diff_context_lines: 3
    syntax_highlight: true
    show_line_numbers: true
    terminal_theme: dark
    file_tree:
      enabled: true
      position: left
      click_to_navigate: true
 
  live_agent:
    enabled: true
    backend: anthropic
    model: claude-sonnet-4-20250514
 
    anthropic:
      api_key: ${ANTHROPIC_API_KEY}
      max_tokens: 4096
      temperature: 0.2
 
    tools:
      - read_file
      - edit_file
      - write_file
      - bash
      - glob
      - grep
 
    max_steps: 100
    step_timeout_seconds: 120
 
    controls:
      pause_resume:
        enabled: true
        auto_pause_on_error: true
        keyboard_shortcut: "Space"
      send_instructions:
        enabled: true
        inject_as: system_message
        presets:
          - "Try a different approach"
          - "Read the error message carefully"
          - "Run the tests first"
      rollback:
        enabled: true
        require_confirmation: true
      branch:
        enabled: true
        max_branches: 5
        compare_view: true
 
    git_checkpoints:
      enabled: true
      branch_prefix: "potato-session"
      auto_commit: true
      cleanup_on_complete: false
 
annotation_schemes:
  # Per-step ratings during observation
  - annotation_type: per_turn_rating
    name: step_quality
    description: "Rate each agent step as you observe it"
    target: agentic_steps
    rating_type: radio
    labels:
      - "Good"
      - "Acceptable"
      - "Unnecessary"
      - "Incorrect"
 
  # Overall task completion after agent finishes
  - annotation_type: radio
    name: task_completion
    description: "Did the agent complete the task?"
    labels:
      - "Fully Complete"
      - "Partially Complete"
      - "Failed"
 
  # Branch comparison (when branching is used)
  - annotation_type: radio
    name: branch_preference
    description: "Which branch produced a better result?"
    labels:
      - "Branch A"
      - "Branch B"
      - "Both Equal"
      - "Both Failed"
 
  # Notes on the observation
  - annotation_type: text
    name: observation_notes
    description: "Describe what you observed and any interventions you made"
    label_requirement:
      required: false
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

Branching Trajectory Export

When annotators use branch and replay, the output includes the full branching tree. This format is designed for training preference models and process reward models from comparative trajectories.

json
{
  "id": "task_001",
  "annotator": "observer_01",
  "root_branch": {
    "branch_id": "main",
    "steps": [
      {"step": 0, "type": "file_read", "file": "src/parser.py", "rating": "Good"},
      {"step": 1, "type": "edit", "file": "src/parser.py", "rating": "Incorrect"}
    ],
    "children": [
      {
        "branch_id": "branch_1",
        "branch_point": 1,
        "instruction": "Try a different approach -- use a try/except block instead",
        "steps": [
          {"step": 2, "type": "edit", "file": "src/parser.py", "rating": "Good"},
          {"step": 3, "type": "terminal", "command": "pytest", "rating": "Good"}
        ],
        "outcome": "Fully Complete",
        "children": []
      },
      {
        "branch_id": "branch_2",
        "branch_point": 1,
        "instruction": "Read the test file first to understand expected behavior",
        "steps": [
          {"step": 2, "type": "file_read", "file": "tests/test_parser.py", "rating": "Good"},
          {"step": 3, "type": "edit", "file": "src/parser.py", "rating": "Good"},
          {"step": 4, "type": "terminal", "command": "pytest", "rating": "Good"}
        ],
        "outcome": "Fully Complete",
        "children": []
      }
    ]
  },
  "branch_preference": "Branch B",
  "observation_notes": "Both branches solved the problem, but branch B produced cleaner code by reading the tests first."
}

Export branching trajectories for preference learning:

bash
# Export as DPO preference pairs from branch comparisons
python -m potato.export \
  -i output/ \
  -f branching_dpo \
  -o results/branch_preferences.jsonl
 
# Export full trajectory trees
python -m potato.export \
  -i output/ \
  -f trajectory_tree \
  -o results/trajectory_trees.jsonl

Security

The live agent runs in the project directory specified in the task data. It has access to read, write, and execute files within that directory. Consider the following security practices:

  • Sandboxing: For untrusted code or untrusted agent models, run Potato inside a Docker container or VM. The agent can execute arbitrary shell commands, so isolation is important.
  • Read-only mode: Disable the bash and write_file tools if you only want the agent to analyze code without modifying it.
  • Network restrictions: Use Docker's --network none flag to prevent the agent from making network requests.
  • Resource limits: Set max_steps and step_timeout_seconds to prevent runaway agents.
yaml
# Restricted tool set for analysis-only tasks
live_agent:
  tools:
    - read_file
    - glob
    - grep
  # No edit_file, write_file, or bash

Troubleshooting

Ollama Not Running

text
Error: Connection refused at http://localhost:11434

Start the Ollama server:

bash
ollama serve

Verify it is running:

bash
ollama list

API Key Missing

text
Error: ANTHROPIC_API_KEY environment variable not set

Set the environment variable:

bash
export ANTHROPIC_API_KEY="sk-ant-..."

Or add it to your project's .env file. Potato loads .env files automatically.

Git Not Initialized

text
Error: Project directory is not a git repository

The checkpoint system requires git. Initialize a repository in the project directory:

bash
cd /path/to/project
git init
git add -A
git commit -m "Initial commit"

Agent Stuck in a Loop

If the agent repeats the same action multiple times, it may be stuck. Potato detects loops when the same tool call with the same arguments is repeated 3 times and automatically pauses the agent. You can configure this threshold:

yaml
live_agent:
  loop_detection:
    enabled: true
    threshold: 3                     # pause after N identical consecutive steps
    action: pause                    # "pause" or "terminate"

Session Branch Cleanup

Over time, session branches accumulate. Clean them up periodically:

bash
# Remove branches older than 7 days
python -m potato.cleanup_sessions --older-than 7d
 
# Remove all session branches
python -m potato.cleanup_sessions --all
 
# Dry run (show what would be deleted)
python -m potato.cleanup_sessions --older-than 7d --dry-run

See Also

For implementation details, see the source documentation.