# Evaluating Voice and Video Agents

Source: https://www.potatoannotator.com/blog/evaluating-voice-and-video-agents

**Agents that talk, watch video, and read documents fail in ways a text box cannot show. A voice agent's mistakes live at the seams between turns; a video agent's answer is a time interval, not a sentence; a document agent's error is a misread table cell. Each of these needs a review surface shaped to the modality.** Potato adds four such surfaces — voice, video, speech, and document — alongside its existing [image](/docs/features/image-annotation) and [audio](/docs/features/audio-annotation) displays. The full reference is [Multimodal-Agent Evaluation](/docs/agent-evaluation/multimodal-agent-evaluation).

![Each modality gets its own review surface: voice, video, speech, and document](/images/blog/multimodal-surface-map.svg "A plain text widget cannot express a barge-in, an event interval, or a table cell")

## How do I evaluate a voice agent's turn-taking?

Spoken agents break at the boundaries: cutting the user off, talking over them, or pausing so long the user gives up. The `voice_interaction` schema lays the conversation out as a **dual-track timeline** — a user lane and an agent lane — and highlights the overlap regions where both speak at once ([Full-Duplex-Bench, 2025](https://arxiv.org/abs/2503.04721)). You classify each overlap and rate the overall turn-taking; the audio plays inline when provided.

![A dual-track voice timeline with a highlighted barge-in region](/images/docs/agent-voice-interaction.png "Dual-track voice timeline with barge-in detection and turn-taking scoring")

```yaml
annotation_schemes:
  - annotation_type: voice_interaction
    name: turn_taking
    description: "Classify each barge-in/overlap and rate the overall turn-taking."
    turns_key: turns
    speaker_key: speaker
    user_speakers: [user, human, caller]
    overlap_labels: [agent_should_respond, agent_should_resume, backchannel, uncertain]
    rating_scale: 5
```

The overlaps are computed from the turn timings at render time, so a [full-duplex](https://en.wikipedia.org/wiki/Duplex_(telecommunications)) conversation that a flat transcript would flatten into "they both said things" becomes a set of concrete, labelable moments.

## How do I score a video agent's temporal grounding?

A video agent's answer to "when does the goal happen?" is an interval, so you score it as one. The `temporal_grounding` schema gives you a scrubber where you mark the gold `[start, end]` for each event prompt, by capturing the playhead or typing seconds. When the data carries the model's predicted interval, a live **[IoU](https://en.wikipedia.org/wiki/Jaccard_index)** and a two-bar mini-timeline update as you adjust ([TimeScope, 2025](https://arxiv.org/abs/2509.26360)).

![A video scrubber with a gold interval and a live IoU readout](/images/docs/agent-temporal-grounding.png "Mark gold event intervals on video with a live IoU vs. the model's prediction")

```yaml
annotation_schemes:
  - annotation_type: temporal_grounding
    name: grounding
    description: "Mark the gold start/end interval for each event. IoU vs prediction updates live."
    video_key: video
    events_key: events
```

This is built for predicted-versus-gold localization, which is a different job from general segment labeling: you are scoring how close the model's span is to the truth, and seeing the IoU move as you drag the boundary makes that immediate.

## What about speech transcripts, reasoning, and tables?

Three more surfaces cover the rest of the multimodal spread:

- **Speech transcripts** (`speech_transcript`): each time-aligned segment is a card; you tag ASR/TTS errors, mispronunciations, and disfluencies and correct the text inline ([Speak & Improve, 2025](https://arxiv.org/abs/2412.11986)). This is the segment-level complement to the turn-taking view.
- **Interleaved reasoning** (`multimodal_reasoning`): a text-image-tool trace rendered as typed blocks; you rate each step's coherence and flag visual hallucinations where the reasoning does not follow from the image ([Multimodal RewardBench 2, 2025](https://arxiv.org/abs/2512.16899)).
- **Document tables** (`table_grid`): you set the grid dimensions and click cells to mark their role — data, column header, row header, empty — capturing the structure that bounding boxes cannot.

![Speech-transcript segments with per-segment error tags and inline correction](/images/docs/agent-speech-transcript.png "Tag ASR/TTS/pronunciation errors per segment and correct the transcript inline")

```yaml
annotation_schemes:
  - annotation_type: speech_transcript
    name: speech_errors
    description: "Tag speech errors on each segment and correct the transcript where needed."
    segments_key: segments
    error_types: [asr_error, tts_artifact, mispronunciation, disfluency]
    allow_correction: true
```

![Interleaved reasoning trace with a flagged visual hallucination](/images/docs/agent-multimodal-reasoning.png "Rate each step of a text-image-tool reasoning trace for coherence and visual hallucination")

Several of these schemas can run on the same task, so a single document-agent run can be scored for table structure and reasoning coherence at once.

![A table image with cells marked as headers, data, and empty](/images/docs/agent-table-grid.png "Annotate document-table cell structure: column and row headers, data, and empty cells")

## How do I set this up?

Each surface ships a runnable example under `examples/agent-traces/`:

```bash
pip install --upgrade potato-annotation
python potato/flask_server.py start examples/agent-traces/temporal-grounding/config.yaml -p 8000
```

Your data drops in as turns, segments, or events with timestamps; the surface derives its timeline from them at render time. For GUI and OS agents, the companion piece is [Evaluating Computer-Use Agents](/blog/computer-use-agent-evaluation).

## Further reading

- [Multimodal-Agent Evaluation](/docs/agent-evaluation/multimodal-agent-evaluation) — the full schema reference
- [Evaluating Computer-Use and Multimodal Agents](/docs/guides/evaluating-computer-use-agents) — the guide, with a schema-selection table
- [Evaluating Computer-Use Agents, Step by Step](/blog/computer-use-agent-evaluation) — the GUI and OS half of the multimodal surfaces
- [Potato 2.6.2: A Complete Open-Source Agent-Evaluation Suite](/blog/potato-2-6-2-agent-evaluation-suite) — everything in the 2.6.x line