Skip to content
Guides4 min read

Evaluating Voice and Video Agents

A walkthrough of human evaluation for spoken, video, and document agents in Potato: scoring turn-taking on a dual-track timeline, grounding video events with live IoU, tagging speech errors, and marking table structure.

Potato Team

Agents that talk, watch video, and read documents fail in ways a text box cannot show. A voice agent's mistakes live at the seams between turns; a video agent's answer is a time interval, not a sentence; a document agent's error is a misread table cell. Each of these needs a review surface shaped to the modality. Potato adds four such surfaces — voice, video, speech, and document — alongside its existing image and audio displays. The full reference is Multimodal-Agent Evaluation.

Each modality gets its own review surface: voice, video, speech, and documentA plain text widget cannot express a barge-in, an event interval, or a table cell

How do I evaluate a voice agent's turn-taking?

Spoken agents break at the boundaries: cutting the user off, talking over them, or pausing so long the user gives up. The voice_interaction schema lays the conversation out as a dual-track timeline — a user lane and an agent lane — and highlights the overlap regions where both speak at once (Full-Duplex-Bench, 2025). You classify each overlap and rate the overall turn-taking; the audio plays inline when provided.

A dual-track voice timeline with a highlighted barge-in regionDual-track voice timeline with barge-in detection and turn-taking scoring

yaml
annotation_schemes:
  - annotation_type: voice_interaction
    name: turn_taking
    description: "Classify each barge-in/overlap and rate the overall turn-taking."
    turns_key: turns
    speaker_key: speaker
    user_speakers: [user, human, caller]
    overlap_labels: [agent_should_respond, agent_should_resume, backchannel, uncertain]
    rating_scale: 5

The overlaps are computed from the turn timings at render time, so a full-duplex conversation that a flat transcript would flatten into "they both said things" becomes a set of concrete, labelable moments.

How do I score a video agent's temporal grounding?

A video agent's answer to "when does the goal happen?" is an interval, so you score it as one. The temporal_grounding schema gives you a scrubber where you mark the gold [start, end] for each event prompt, by capturing the playhead or typing seconds. When the data carries the model's predicted interval, a live IoU and a two-bar mini-timeline update as you adjust (TimeScope, 2025).

A video scrubber with a gold interval and a live IoU readoutMark gold event intervals on video with a live IoU vs. the model's prediction

yaml
annotation_schemes:
  - annotation_type: temporal_grounding
    name: grounding
    description: "Mark the gold start/end interval for each event. IoU vs prediction updates live."
    video_key: video
    events_key: events

This is built for predicted-versus-gold localization, which is a different job from general segment labeling: you are scoring how close the model's span is to the truth, and seeing the IoU move as you drag the boundary makes that immediate.

What about speech transcripts, reasoning, and tables?

Three more surfaces cover the rest of the multimodal spread:

  • Speech transcripts (speech_transcript): each time-aligned segment is a card; you tag ASR/TTS errors, mispronunciations, and disfluencies and correct the text inline (Speak & Improve, 2025). This is the segment-level complement to the turn-taking view.
  • Interleaved reasoning (multimodal_reasoning): a text-image-tool trace rendered as typed blocks; you rate each step's coherence and flag visual hallucinations where the reasoning does not follow from the image (Multimodal RewardBench 2, 2025).
  • Document tables (table_grid): you set the grid dimensions and click cells to mark their role — data, column header, row header, empty — capturing the structure that bounding boxes cannot.

Speech-transcript segments with per-segment error tags and inline correctionTag ASR/TTS/pronunciation errors per segment and correct the transcript inline

yaml
annotation_schemes:
  - annotation_type: speech_transcript
    name: speech_errors
    description: "Tag speech errors on each segment and correct the transcript where needed."
    segments_key: segments
    error_types: [asr_error, tts_artifact, mispronunciation, disfluency]
    allow_correction: true

Interleaved reasoning trace with a flagged visual hallucinationRate each step of a text-image-tool reasoning trace for coherence and visual hallucination

Several of these schemas can run on the same task, so a single document-agent run can be scored for table structure and reasoning coherence at once.

A table image with cells marked as headers, data, and emptyAnnotate document-table cell structure: column and row headers, data, and empty cells

How do I set this up?

Each surface ships a runnable example under examples/agent-traces/:

bash
pip install --upgrade potato-annotation
python potato/flask_server.py start examples/agent-traces/temporal-grounding/config.yaml -p 8000

Your data drops in as turns, segments, or events with timestamps; the surface derives its timeline from them at render time. For GUI and OS agents, the companion piece is Evaluating Computer-Use Agents.

Further reading