Skip to content
Cette page n'est pas encore disponible dans votre langue. La version anglaise est affichée.

Web-Agent Evaluation

How to evaluate web-browsing agents with screenshots and action overlays, per-step web action correctness, using Potato's web agent display.

A web agent completes tasks by browsing, clicking, typing, scrolling across pages. Evaluating one means looking at what it saw (the screenshot) and what it did (the action) at each step, and judging whether that action was right. Potato renders the screenshots with visual overlays of each action so annotators can review a browsing session like a filmstrip.

This is the human-evaluation counterpart to benchmarks like WebArena and Mind2Web. See Web Agent Annotation.

What the annotator sees

Potato's web agent display shows, for each step:

  • the screenshot of the page at that moment,
  • an overlay marking the action, a circle where it clicked, a box on the field it typed into, an arrow for a scroll,
  • the action description and any target element,
  • a filmstrip to move between steps.

What to judge per step

  • Right target? Did it click/​type on the correct element?
  • Right action type? Click vs. type vs. scroll vs. navigate.
  • Progress? Did the step move the task forward or waste a turn?
yaml
annotation_schemes:
  - annotation_type: per_turn_rating
    name: web_action_correctness
    description: "Judge each browsing action against the task."
    target: agentic_steps
    rating_type: radio
    labels: ["Correct", "Wrong target", "Wrong action", "No progress"]

Setting up the display

Point Potato at a web-agent trace (screenshots plus actions) and enable the web agent display. Traces can come from WebArena/VisualWebArena exports or your own runs in HAR-plus-screenshot form. See Web Agent Annotation for the trace schema.

Quality considerations

  • Screenshots must be legible, set a sensible max width and keep overlays from hiding the target.
  • Long sessions fatigue annotators; the filmstrip and step numbers help them keep place.
  • For overall task success, add a trajectory-level label on top of the per-step ratings. See Annotating Agent Trajectories.

Further reading