Live Coding Agent Observation
Watch coding agents work in real time with pause, rollback, and branching. Three backends supported: Ollama for local models, Anthropic API, and Claude Agent SDK.
Live Coding Agent Observation
New in v2.4.0
Static trace annotation tells you what an agent did. Live observation tells you what an agent does in response to human guidance. Potato's live coding agent mode lets annotators watch a coding agent work in real time -- reading files, editing code, running tests -- and intervene at any point. Pause the agent, send new instructions, rollback to a previous checkpoint, or branch the trajectory to explore alternative approaches.
This produces richer annotation data than static traces alone. You get the full trajectory with timestamps, the annotator's interventions, branching decision points, and comparative data from alternative paths. This data is directly useful for training process reward models, preference models, and instruction-following evaluators.
Requirements
- Python 3.10+
- Git (the checkpoint system uses git commits)
- One of the following agent backends:
- Ollama for local model inference (no API key required)
- ANTHROPIC_API_KEY for Anthropic API access
- Claude Agent SDK for the full Claude Code agent experience
Backends
Potato supports three backends for running coding agents. Each backend runs the agent in a subprocess and streams its actions to the annotation interface in real time.
1. Ollama (Local Models)
Run coding agents locally with no API key required. Ollama provides fast inference for open-weight models. Best for development, testing, and situations where data cannot leave the local machine.
Setup:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a coding-capable model
ollama pull qwen2.5-coder:7b
# Or a larger model for better performance
ollama pull deepseek-coder-v2:16bConfiguration:
agentic:
enabled: true
display_type: coding_trace
live_agent:
enabled: true
backend: ollama
model: qwen2.5-coder:7b
ollama:
host: "http://localhost:11434" # Ollama server URL
temperature: 0.2
num_ctx: 8192 # context window size
num_predict: 2048 # max tokens per response
keep_alive: "5m" # keep model loaded in memory
# Agent capabilities
tools:
- read_file
- edit_file
- write_file
- bash
- glob
- grep
max_steps: 50
step_timeout_seconds: 602. Anthropic API
Use Claude models via the Anthropic API. Provides strong coding performance with tool use capabilities. Requires an API key.
Setup:
# Set your API key
export ANTHROPIC_API_KEY="sk-ant-..."
# Or add to .env file
echo "ANTHROPIC_API_KEY=sk-ant-..." >> .envConfiguration:
agentic:
enabled: true
display_type: coding_trace
live_agent:
enabled: true
backend: anthropic
model: claude-sonnet-4-20250514
anthropic:
api_key: ${ANTHROPIC_API_KEY}
max_tokens: 4096
temperature: 0.2
system_prompt: |
You are a coding assistant working on a software project.
Read files before editing them. Run tests after making changes.
Explain your reasoning before each action.
# Agent capabilities
tools:
- read_file
- edit_file
- write_file
- bash
- glob
- grep
max_steps: 100
step_timeout_seconds: 1203. Claude Agent SDK
The Claude Agent SDK provides the full Claude Code agent experience, including automatic tool orchestration, context management, and multi-file reasoning. This is the most capable backend but requires the SDK to be installed.
Setup:
# Install the Claude Agent SDK
pip install claude-agent-sdk
# Set your API key
export ANTHROPIC_API_KEY="sk-ant-..."Configuration:
agentic:
enabled: true
display_type: coding_trace
live_agent:
enabled: true
backend: claude_agent_sdk
claude_agent_sdk:
api_key: ${ANTHROPIC_API_KEY}
model: claude-sonnet-4-20250514
max_turns: 100
permission_mode: auto # auto-approve tool use
enable_thinking: true # show extended thinking
max_steps: 100
step_timeout_seconds: 180Controls
The annotation interface provides four control actions that let annotators guide the agent's behavior.
Pause / Resume
Click Pause to halt the agent between steps. The agent finishes its current step and waits. The annotator can review the current state, examine files, and decide whether to let the agent continue or intervene. Click Resume to let the agent proceed.
live_agent:
controls:
pause_resume:
enabled: true
auto_pause_on_error: true # pause when a command fails
auto_pause_after_steps: 0 # pause after N steps (0 = disabled)
keyboard_shortcut: "Space"Send Instructions
While the agent is paused, annotators can send new instructions that redirect the agent. This is useful when the agent is going down the wrong path or when the annotator wants to test how the agent responds to guidance.
live_agent:
controls:
send_instructions:
enabled: true
placeholder: "Type instructions for the agent..."
inject_as: system_message # "system_message" or "user_message"
keyboard_shortcut: "Enter"
presets:
- "Try a different approach"
- "Read the error message more carefully"
- "Check the test file for expected behavior"
- "Revert your last change and try again"Instructions are injected into the agent's conversation context. The inject_as option controls whether they appear as a system message (authoritative instruction) or a user message (conversational guidance).
Rollback
Rollback reverts the project to a previous git checkpoint. Every file change the agent makes is automatically committed, so the annotator can click any previous step in the timeline and roll back to that exact state. The agent's conversation context is also truncated to match.
live_agent:
controls:
rollback:
enabled: true
show_checkpoint_diff: true # show what will be undone
require_confirmation: true # "Are you sure?" dialog
keyboard_shortcut: "Ctrl+Z"Branch and Replay
Branch and replay combines rollback with instruction sending. The annotator rolls back to a checkpoint and sends different instructions, creating a branching trajectory. This is powerful for collecting preference data: you can explore two different approaches from the same starting point and compare outcomes.
live_agent:
controls:
branch:
enabled: true
max_branches: 5 # maximum branches from any checkpoint
branch_naming: auto # "auto" or "manual"
compare_view: true # side-by-side branch comparison
keyboard_shortcut: "Ctrl+B"The branch comparison view shows two branches side by side, highlighting where they diverge. Annotators can rate which branch produced better results, generating preference pairs for DPO training.
Git Checkpoint System
The live agent mode uses git to track every file change. This provides reliable rollback, branching, and full change history.
How It Works
- Before the agent starts, Potato creates a new git branch named
potato-session-{session_id} - After every file change (edit, write, create, delete), Potato automatically commits with a descriptive message
- Each commit is tagged as a checkpoint that appears in the timeline
- Rollback uses
git checkoutto restore the working directory to any checkpoint - Branching creates a new git branch from the checkpoint commit
Configuration
live_agent:
git_checkpoints:
enabled: true
branch_prefix: "potato-session"
commit_message_format: "Step {step}: {tool} {file_path}"
auto_commit: true
cleanup_on_complete: false # delete session branches when done
require_clean_working_dir: true # fail if there are uncommitted changesManual Checkpoint Management
# List all Potato session branches
git branch | grep potato-session
# View checkpoints for a session
git log potato-session-abc123 --oneline
# Clean up old session branches
python -m potato.cleanup_sessions --older-than 7dData Format
Input data for live coding agent tasks specifies the task description and optionally a starting file or directory:
{
"id": "task_001",
"task_description": "Fix the bug in src/parser.py where empty input causes a crash",
"project_dir": "/path/to/project",
"start_file": "src/parser.py",
"test_command": "python -m pytest tests/test_parser.py -v",
"context_files": [
"src/parser.py",
"tests/test_parser.py"
]
}| Field | Required | Description |
|---|---|---|
id | Yes | Unique task identifier |
task_description | Yes | What the agent should do |
project_dir | Yes | Path to the project directory |
start_file | No | File to show the agent initially |
test_command | No | Command to verify the fix |
context_files | No | Files to pre-load into the agent's context |
Configuration Reference
Complete configuration for a live coding agent observation task:
task_name: "Live Coding Agent Observation"
task_dir: "."
data_files:
- "data/coding_tasks.jsonl"
item_properties:
id_key: id
text_key: task_description
agentic:
enabled: true
display_type: coding_trace
coding_trace_display:
diff_style: unified
diff_context_lines: 3
syntax_highlight: true
show_line_numbers: true
terminal_theme: dark
file_tree:
enabled: true
position: left
click_to_navigate: true
live_agent:
enabled: true
backend: anthropic
model: claude-sonnet-4-20250514
anthropic:
api_key: ${ANTHROPIC_API_KEY}
max_tokens: 4096
temperature: 0.2
tools:
- read_file
- edit_file
- write_file
- bash
- glob
- grep
max_steps: 100
step_timeout_seconds: 120
controls:
pause_resume:
enabled: true
auto_pause_on_error: true
keyboard_shortcut: "Space"
send_instructions:
enabled: true
inject_as: system_message
presets:
- "Try a different approach"
- "Read the error message carefully"
- "Run the tests first"
rollback:
enabled: true
require_confirmation: true
branch:
enabled: true
max_branches: 5
compare_view: true
git_checkpoints:
enabled: true
branch_prefix: "potato-session"
auto_commit: true
cleanup_on_complete: false
annotation_schemes:
# Per-step ratings during observation
- annotation_type: per_turn_rating
name: step_quality
description: "Rate each agent step as you observe it"
target: agentic_steps
rating_type: radio
labels:
- "Good"
- "Acceptable"
- "Unnecessary"
- "Incorrect"
# Overall task completion after agent finishes
- annotation_type: radio
name: task_completion
description: "Did the agent complete the task?"
labels:
- "Fully Complete"
- "Partially Complete"
- "Failed"
# Branch comparison (when branching is used)
- annotation_type: radio
name: branch_preference
description: "Which branch produced a better result?"
labels:
- "Branch A"
- "Branch B"
- "Both Equal"
- "Both Failed"
# Notes on the observation
- annotation_type: text
name: observation_notes
description: "Describe what you observed and any interventions you made"
label_requirement:
required: false
output_annotation_dir: "output/"
output_annotation_format: "jsonl"Branching Trajectory Export
When annotators use branch and replay, the output includes the full branching tree. This format is designed for training preference models and process reward models from comparative trajectories.
{
"id": "task_001",
"annotator": "observer_01",
"root_branch": {
"branch_id": "main",
"steps": [
{"step": 0, "type": "file_read", "file": "src/parser.py", "rating": "Good"},
{"step": 1, "type": "edit", "file": "src/parser.py", "rating": "Incorrect"}
],
"children": [
{
"branch_id": "branch_1",
"branch_point": 1,
"instruction": "Try a different approach -- use a try/except block instead",
"steps": [
{"step": 2, "type": "edit", "file": "src/parser.py", "rating": "Good"},
{"step": 3, "type": "terminal", "command": "pytest", "rating": "Good"}
],
"outcome": "Fully Complete",
"children": []
},
{
"branch_id": "branch_2",
"branch_point": 1,
"instruction": "Read the test file first to understand expected behavior",
"steps": [
{"step": 2, "type": "file_read", "file": "tests/test_parser.py", "rating": "Good"},
{"step": 3, "type": "edit", "file": "src/parser.py", "rating": "Good"},
{"step": 4, "type": "terminal", "command": "pytest", "rating": "Good"}
],
"outcome": "Fully Complete",
"children": []
}
]
},
"branch_preference": "Branch B",
"observation_notes": "Both branches solved the problem, but branch B produced cleaner code by reading the tests first."
}Export branching trajectories for preference learning:
# Export as DPO preference pairs from branch comparisons
python -m potato.export \
-i output/ \
-f branching_dpo \
-o results/branch_preferences.jsonl
# Export full trajectory trees
python -m potato.export \
-i output/ \
-f trajectory_tree \
-o results/trajectory_trees.jsonlSecurity
The live agent runs in the project directory specified in the task data. It has access to read, write, and execute files within that directory. Consider the following security practices:
- Sandboxing: For untrusted code or untrusted agent models, run Potato inside a Docker container or VM. The agent can execute arbitrary shell commands, so isolation is important.
- Read-only mode: Disable the
bashandwrite_filetools if you only want the agent to analyze code without modifying it. - Network restrictions: Use Docker's
--network noneflag to prevent the agent from making network requests. - Resource limits: Set
max_stepsandstep_timeout_secondsto prevent runaway agents.
# Restricted tool set for analysis-only tasks
live_agent:
tools:
- read_file
- glob
- grep
# No edit_file, write_file, or bashTroubleshooting
Ollama Not Running
Error: Connection refused at http://localhost:11434
Start the Ollama server:
ollama serveVerify it is running:
ollama listAPI Key Missing
Error: ANTHROPIC_API_KEY environment variable not set
Set the environment variable:
export ANTHROPIC_API_KEY="sk-ant-..."Or add it to your project's .env file. Potato loads .env files automatically.
Git Not Initialized
Error: Project directory is not a git repository
The checkpoint system requires git. Initialize a repository in the project directory:
cd /path/to/project
git init
git add -A
git commit -m "Initial commit"Agent Stuck in a Loop
If the agent repeats the same action multiple times, it may be stuck. Potato detects loops when the same tool call with the same arguments is repeated 3 times and automatically pauses the agent. You can configure this threshold:
live_agent:
loop_detection:
enabled: true
threshold: 3 # pause after N identical consecutive steps
action: pause # "pause" or "terminate"Session Branch Cleanup
Over time, session branches accumulate. Clean them up periodically:
# Remove branches older than 7 days
python -m potato.cleanup_sessions --older-than 7d
# Remove all session branches
python -m potato.cleanup_sessions --all
# Dry run (show what would be deleted)
python -m potato.cleanup_sessions --older-than 7d --dry-runSee Also
- Coding Agent Annotation -- annotate static coding agent traces
- Process Reward Annotation -- collect per-step reward signals for PRM training
- Code Review Annotation -- GitHub PR-style inline review for code changes
- Agentic Annotation -- general-purpose agent trace annotation
For implementation details, see the source documentation.