# Evaluating Computer-Use and Multimodal Agents

Source: https://www.potatoannotator.com/docs/guides/evaluating-computer-use-agents

**A computer-use agent controls a graphical interface the way a person would: it reads a screenshot, decides on an action (click, type, scroll), and acts. Evaluating one means checking, step by step, whether each action was right and whether the click actually landed on the intended element, not just whether the task eventually succeeded.** Potato is an open-source tool for human evaluation of computer-use, GUI, voice, video, and document agents, with [annotation surfaces purpose-built for each modality](/docs/agent-evaluation/multimodal-agent-evaluation).

A [computer-use agent](https://en.wikipedia.org/wiki/Software_agent) (also called a GUI or OS agent) perceives the screen as pixels or a DOM and acts through the same controls a user has. Benchmarks like [OSWorld](https://arxiv.org/abs/2404.07972), ScreenSpot, and AndroidWorld score task success automatically; human review adds what automation misses, the action that produced the right outcome by luck, or the click that hit the wrong button but still advanced the task.

## What do you judge in a computer-use trajectory?

Each step pairs a **screenshot** (what the agent saw) with an **action** (what it did). The annotator judges the action and, when the step carries click coordinates, checks the grounding marker on the screenshot:

- **Action correctness**: correct, wrong element, wrong action, or hallucinated.
- **Click grounding**: did the coordinates land on the element the action named?
- **Outcome**: did the run complete the task, and at which step did it first go wrong?

```yaml
annotation_schemes:
  - annotation_type: gui_trajectory
    name: gui_review
    description: "For each step: was the action correct and did the click land right?"
    steps_key: steps
    screenshot_key: screenshot
    action_key: action
    coord_space: normalized
    verdict_options: [correct, wrong_element, wrong_action, hallucinated]
```

Catching the first wrong step matters more than a single pass/fail, because that step is what you would fix or train against; see [Process Reward Models](/docs/guides/process-reward-models).

## How do I evaluate a voice agent's turn-taking?

Spoken agents fail at the seams between turns: cutting the user off, talking over them, or pausing too long. The [`voice_interaction`](/docs/agent-evaluation/multimodal-agent-evaluation#voice--full-duplex-interaction-voice_interaction) schema lays the conversation out as a dual-track timeline and highlights overlap regions where both speakers talk at once, which the annotator classifies (the agent should respond, should resume, was a backchannel, or it is unclear) and then rates the overall turn-taking. This is the [full-duplex](https://en.wikipedia.org/wiki/Duplex_(telecommunications)) view that a flat transcript cannot express.

## How do I score video and document agents?

- **Video temporal grounding**: for each event prompt, mark the gold `[start, end]` interval; when the data includes a model's predicted interval, a live [IoU](https://en.wikipedia.org/wiki/Jaccard_index) updates as you adjust, so you score localization directly.
- **Speech transcripts**: tag ASR/TTS errors segment by segment and correct the text inline.
- **Document tables**: mark the cell structure (column headers, row headers, data, empty) that bounding boxes cannot capture.
- **Interleaved reasoning**: rate each step of a text-image-tool trace for coherence and flag visual hallucinations.

Each is a separate schema in the [multimodal-agent reference](/docs/agent-evaluation/multimodal-agent-evaluation), and several can run on the same task.

## Which schema should I use?

| Agent type | Schema | What you label |
|---|---|---|
| Computer-use / GUI | `gui_trajectory` | Action correctness + click grounding |
| Voice / spoken | `voice_interaction` | Barge-in handling and turn-taking |
| Video | `temporal_grounding` | Gold event intervals vs. prediction (IoU) |
| Speech transcript | `speech_transcript` | ASR/TTS errors per segment |
| Document / table | `table_grid` | Cell-structure roles |
| Multimodal reasoning | `multimodal_reasoning` | Step coherence and visual hallucination |

## Further reading

- [Multimodal-Agent Evaluation](/docs/agent-evaluation/multimodal-agent-evaluation) — the full schema reference
- [Web-Agent Evaluation](/docs/guides/web-agent-evaluation) — screenshot-and-action web agents
- [How to Evaluate AI Agents](/docs/guides/evaluating-ai-agents) — the levels of agent evaluation
- [How to Evaluate Multi-Agent Systems](/docs/guides/evaluating-multi-agent-systems)
