# Web-Agent Evaluation

Source: https://www.potatoannotator.com/docs/guides/web-agent-evaluation

**A web agent completes tasks by browsing, clicking, typing, scrolling across pages. Evaluating one means looking at what it saw (the screenshot) and what it did (the action) at each step, and judging whether that action was right.** Potato renders the screenshots with visual overlays of each action so annotators can review a browsing session like a filmstrip.

This is the human-evaluation counterpart to benchmarks like [WebArena](https://webarena.dev/) and Mind2Web. See [Web Agent Annotation](/docs/features/web-agent-annotation).

## What the annotator sees

Potato's web agent display shows, for each step:

- the **screenshot** of the page at that moment,
- an **overlay** marking the action, a circle where it clicked, a box on the field it typed into, an arrow for a scroll,
- the **action description** and any target element,
- a **filmstrip** to move between steps.

## What to judge per step

- **Right target?** Did it click/​type on the correct element?
- **Right action type?** Click vs. type vs. scroll vs. navigate.
- **Progress?** Did the step move the task forward or waste a turn?

```yaml
annotation_schemes:
  - annotation_type: per_turn_rating
    name: web_action_correctness
    description: "Judge each browsing action against the task."
    target: agentic_steps
    rating_type: radio
    labels: ["Correct", "Wrong target", "Wrong action", "No progress"]
```

## Setting up the display

Point Potato at a web-agent trace (screenshots plus actions) and enable the web agent display. Traces can come from WebArena/VisualWebArena exports or your own runs in HAR-plus-screenshot form. See [Web Agent Annotation](/docs/features/web-agent-annotation) for the trace schema.

## Quality considerations

- Screenshots must be legible, set a sensible max width and keep overlays from hiding the target.
- Long sessions fatigue annotators; the filmstrip and step numbers help them keep place.
- For overall task success, add a trajectory-level label on top of the per-step ratings. See [Annotating Agent Trajectories](/docs/guides/agent-trajectory-annotation).

## Further reading

- [Web Agent Annotation feature reference](/docs/features/web-agent-annotation)
- [Live Agent Evaluation](/docs/guides/live-agent-evaluation)
- [How to Evaluate AI Agents](/docs/guides/evaluating-ai-agents)