एजेंटिक एनोटेशन

13 trace format converters, 5 display types, और tool-use, web-browsing, coding व chat agents के लिए pre-built schemas के साथ Potato में AI agents का मूल्यांकन करें। PRM और rubric मूल्यांकन शामिल है।

v2.3.0 में नया

Agentic annotation: a raw trace is converted, shown as thought, action, and observation cards, annotated at the step and trace level, and exported How agentic annotation works

AI agents को तेजी से जटिल बहु-चरणीय कार्यों के लिए तैनात किया जा रहा है: web browsing, code लिखना, APIs कॉल करना, और sub-agents को orchestrate करना। लेकिन यह मूल्यांकन करना कि agent ने वास्तव में सही काम किया या नहीं, इसके लिए उस स्तर की मानवीय निर्णय शक्ति की आवश्यकता होती है जिसे पारंपरिक annotation tools समर्थित नहीं कर सकते। एक single agent trace में दर्जनों steps, tool calls, intermediate reasoning, screenshots, और branching decisions हो सकते हैं। Annotators को यह सब context देखना होगा, इसे कुशलतापूर्वक नेविगेट करना होगा, और trace स्तर और individual step स्तर दोनों पर structured evaluations प्रदान करनी होगी।

Potato की agentic annotation प्रणाली इसे चार क्षमताओं के साथ संबोधित करती है:

13 trace format converters जो किसी भी प्रमुख framework से agent logs को एकीकृत format में normalize करते हैं
5 specialized display types विभिन्न agent modalities (tool-use, web browsing, coding, chat, live observation) के लिए अनुकूलित
9 pre-built annotation schemas सबसे सामान्य agent evaluation dimensions को कवर करते हुए
4 purpose-built annotation types उन्नत मूल्यांकन के लिए: trajectory evaluation, rubric evaluation, pairwise comparison, और process reward annotation

Trace Format Converters

Agent traces framework के आधार पर बहुत अलग-अलग formats में आती हैं। Potato 13 converters के साथ आता है जो इन्हें एकीकृत internal representation में normalize करते हैं। आप config में converter निर्दिष्ट करते हैं, या Potato को format auto-detect करने देते हैं।

Converter Reference

Converter	Source Format	निकाले गए प्रमुख Fields
`openai`	OpenAI Assistants API / function calling logs	messages, tool_calls, function results
`anthropic`	Anthropic Claude tool_use / Messages API	content blocks, tool_use, tool_result
`swebench`	SWE-bench task traces	patch, test results, trajectory
`opentelemetry`	OpenTelemetry span exports (JSON)	spans, attributes, events, parent-child
`mcp`	Model Context Protocol sessions	tool definitions, call/response pairs
`multi_agent`	CrewAI / AutoGen / LangGraph multi-agent logs	agent roles, delegation, message passing
`langchain`	LangChain callback traces	chain runs, LLM calls, tool invocations
`langfuse`	LangFuse observation exports	generations, spans, scores
`react`	ReAct-style Thought/Action/Observation logs	thought, action, action_input, observation
`webarena`	WebArena / VisualWebArena trace JSON	actions, screenshots, DOM snapshots, URLs
`atif`	Agent Trace Interchange Format (ATIF)	steps, observations, metadata
`raw_web`	Raw browser recordings (HAR + screenshots)	requests, responses, screenshots, timings
`claude_code`	Claude Code / Aider / coding agents	tool_use blocks, diffs, terminal output

कॉन्फ़िगरेशन

अपने project config में converter निर्दिष्ट करें:

yaml

agentic:
  enabled: true
  trace_converter: react
  trace_file: "data/agent_traces.jsonl"

Trace file में प्रत्येक पंक्ति एक JSON object होनी चाहिए जिसमें raw agent trace हो। Converter बाकी सब संभालता है।

Multi-agent traces के लिए जहाँ विभिन्न agents अलग-अलग frameworks का उपयोग करते हैं, आप per-agent converters निर्दिष्ट कर सकते हैं:

yaml

agentic:
  enabled: true
  trace_converter: multi_agent
  trace_file: "data/multi_agent_traces.jsonl"
  multi_agent:
    agent_converters:
      planner: react
      coder: anthropic
      reviewer: openai

Auto-Detection

यदि आप अनिश्चित हैं कि कौन सा converter उपयोग करें, तो trace_converter: auto सेट करें:

yaml

agentic:
  enabled: true
  trace_converter: auto
  trace_file: "data/traces.jsonl"

Potato पहले 10 traces का निरीक्षण करता है और field signatures के आधार पर सबसे उपयुक्त converter चुनता है। यदि confidence 80% से कम है तो warning log किया जाता है, जिस स्थिति में आपको converter स्पष्ट रूप से निर्दिष्ट करना चाहिए।

Custom Converters

यदि आपका agent framework सूचीबद्ध नहीं है, तो आप Python converter लिख सकते हैं:

python

# converters/my_converter.py
from potato.agentic.base_converter import BaseTraceConverter
 
class MyConverter(BaseTraceConverter):
    name = "my_framework"
 
    def convert(self, raw_trace: dict) -> dict:
        steps = []
        for entry in raw_trace["log"]:
            steps.append({
                "type": entry.get("kind", "action"),
                "content": entry["text"],
                "timestamp": entry.get("ts"),
                "metadata": entry.get("extra", {}),
            })
        return {"steps": steps}

Config में register करें:

yaml

agentic:
  trace_converter: custom
  custom_converter: "converters/my_converter.py:MyConverter"

Display Types

Traces convert होने के बाद, Potato उन्हें पाँच specialized display types में से एक का उपयोग करके render करता है। प्रत्येक एक अलग agent modality के लिए अनुकूलित है।

1. Agent Trace Display

Tool-using agents (OpenAI function calling, Anthropic tool_use, ReAct, LangChain, आदि) के लिए default display। यह प्रत्येक step को step type के अनुसार color-coding के साथ card के रूप में render करता है।

yaml

agentic:
  enabled: true
  trace_converter: openai
  display_type: agent_trace
 
  agent_trace_display:
    # Color coding for step types
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
      system: "#6b7280"
 
    # Collapsible sections
    collapse_observations: true
    collapse_threshold: 500    # characters before auto-collapsing
 
    # Step numbering
    show_step_numbers: true
    show_timestamps: true
 
    # Tool call rendering
    render_json: true          # pretty-print JSON arguments
    syntax_highlight: true     # highlight code in observations

विशेषताएँ:

Step cards बायीं ओर रंगीन बॉर्डर के साथ जो type (thought, action, observation, error) दर्शाता है
Collapsible sections लंबी observations या tool outputs के लिए (configurable threshold)
JSON pretty-printing tool call arguments और structured responses के लिए
Syntax highlighting observations में code blocks के लिए
Step timeline sidebar जो पूरे trace को एक नज़र में दिखाता है
Jump-to-step navigation लंबी traces के लिए

2. Web Agent Trace Display

Web browsing agents (WebArena, VisualWebArena, raw browser recordings) के लिए purpose-built। Screenshots को SVG overlays के साथ render करता है जो दिखाते हैं कि agent ने कहाँ क्लिक किया, टाइप किया, या scroll किया।

yaml

agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent
 
  web_agent_display:
    # Screenshot rendering
    screenshot_max_width: 900
    screenshot_quality: 85
 
    # SVG overlay for agent actions
    overlay:
      enabled: true
      click_marker: "circle"       # circle, crosshair, or arrow
      click_color: "#ef4444"
      click_radius: 20
      type_highlight: "#3b82f6"    # highlight for text input fields
      scroll_indicator: true
 
    # Filmstrip view
    filmstrip:
      enabled: true
      thumbnail_width: 150
      show_action_labels: true
 
    # DOM snapshot display
    show_dom_snapshot: false        # optional raw DOM view
    show_url_bar: true
    show_action_description: true

विशेषताएँ:

Screenshot gallery पूर्ण-आकार में देखने और zoom के साथ
SVG overlays click targets (लाल वृत्त), text input regions (नीले highlights), और scroll directions दिखाते हुए
Filmstrip view सबसे नीचे जो सभी screenshots को त्वरित navigation के लिए thumbnails के रूप में दिखाती है
Action description प्रत्येक screenshot के नीचे text (उदा., "Click on 'Add to Cart' button")
URL bar जो प्रत्येक step पर वर्तमान page का URL दिखाती है
Before/after comparison उन steps के लिए जो page content को संशोधित करते हैं

3. Interactive Chat Display

Conversational agents और chatbots के मूल्यांकन के लिए। दो sub-modes का समर्थन करता है: live chat जहाँ annotators agent के साथ रियल टाइम में बातचीत करते हैं, और trace review जहाँ annotators एक रिकॉर्डेड conversation का मूल्यांकन करते हैं।

yaml

agentic:
  enabled: true
  display_type: interactive_chat
 
  interactive_chat_display:
    mode: trace_review         # or "live_chat"
 
    # Trace review settings
    trace_review:
      show_system_prompt: false
      show_token_counts: true
      show_latency: true
      message_grouping: turn    # "turn" or "message"
 
    # Live chat settings (when mode: live_chat)
    live_chat:
      proxy: openai             # agent proxy to use
      max_turns: 20
      timeout_seconds: 60
      show_typing_indicator: true
      allow_regenerate: true
 
    # Common settings
    show_role_labels: true
    role_colors:
      user: "#3b82f6"
      assistant: "#6E56CF"
      system: "#6b7280"
      tool: "#22c55e"

Trace review mode एक रिकॉर्डेड conversation को प्रति message वैकल्पिक token counts और latency के साथ render करता है। Annotators व्यक्तिगत turns या पूरी conversation को rate कर सकते हैं।

Live chat mode annotators को Agent Proxy System (नीचे देखें) के माध्यम से एक चल रहे agent से जोड़ता है। Annotators agent के साथ बातचीत करते हैं, फिर परिणामी conversation को annotate करते हैं।

4. Coding Trace Display

Coding agent sessions (Claude Code, Aider, SWE-Agent) के लिए purpose-built। Code diffs को syntax highlighting के साथ, terminal output को dark blocks में, और file reads को line numbers के साथ render करता है।

yaml

agentic:
  enabled: true
  trace_converter: claude_code
  display_type: coding_trace
 
  coding_trace_display:
    diff_style: unified           # unified or split
    terminal_theme: dark
    show_file_tree: true
    collapse_long_output: true
    collapse_threshold: 50        # lines
    show_line_numbers: true
    syntax_highlight: true

विशेषताएँ:

Unified diff view edit operations के लिए लाल/हरे highlighting के साथ
Dark terminal blocks bash/shell command output के लिए
Line-numbered code blocks file read operations के लिए
File tree sidebar session के दौरान छुई गई सभी files दिखाते हुए
Collapsible long outputs verbose terminal या file content के लिए

पूर्ण reference के लिए Coding Agent Annotation देखें।

5. Live Agent Display

AI agents का रियल-टाइम observation, मानवीय हस्तक्षेप के लिए नियंत्रणों के साथ। Web browsing agents और coding agents का समर्थन करता है।

yaml

agentic:
  enabled: true
  display_type: live_agent

विशेषताएँ:

Real-time streaming agent actions का Server-Sent Events के माध्यम से
Pause/Resume steps के बीच agent को रोकना/फिर शुरू करना
Send instructions task के बीच में agent को पुनर्निर्देशित करने के लिए
Take over मैन्युअल नियंत्रण लेना
Rollback किसी भी पिछले checkpoint पर (coding agents git-आधारित checkpoints का उपयोग करते हैं)
Branch and replay किसी भी checkpoint से अलग instructions के साथ

configuration विवरण के लिए Live Agent Evaluation और Live Coding Agent देखें।

Advanced Annotation Types

प्रति-turn ratings और pre-built schemas से आगे, Potato में structured agent evaluation के लिए चार purpose-built annotation types शामिल हैं।

Trajectory Evaluation (`trajectory_eval`)

पदानुक्रमित error taxonomies और severity scoring के साथ प्रति-step error localization। प्रत्येक step को एक correctness rating, error type, severity level, और वैकल्पिक rationale मिलता है। एक running score tracker severity के आधार पर घटता है।

yaml

annotation_schemes:
  - annotation_type: trajectory_eval
    name: step_eval
    error_taxonomy:
      reasoning:
        - logical_error
        - incorrect_assumption
      action:
        - wrong_tool
        - wrong_arguments
        - premature_termination
    severity_weights:
      minor: -1
      major: -5
      critical: -10

पूर्ण guide के लिए Trajectory Evaluation blog post देखें।

Rubric Evaluation (`rubric_eval`)

MT-Bench-शैली का multi-criteria grid मूल्यांकन। custom criteria और एक rating scale परिभाषित करें। Annotators प्रत्येक criterion को स्वतंत्र रूप से rate करते हैं।

yaml

annotation_schemes:
  - annotation_type: rubric_eval
    name: agent_rubric
    criteria:
      - name: correctness
        description: "Did the agent produce the correct result?"
      - name: efficiency
        description: "Did the agent take an efficient path?"
      - name: safety
        description: "Did the agent avoid unsafe actions?"
    scale: 5
    scale_labels:
      1: "Very Poor"
      3: "Acceptable"
      5: "Excellent"

setup निर्देशों के लिए Rubric Evaluation tutorial देखें।

Pairwise Comparison

दो agent traces की तीन modes में साथ-साथ तुलना करें:

Binary: A या B चुनने के लिए क्लिक करें (वैकल्पिक tie के साथ)
Scale: "A much better" से "B much better" तक slider
Multi-dimension: प्रति dimension स्वतंत्र A/B/tie, आवश्यक justification के साथ

yaml

annotation_schemes:
  - annotation_type: pairwise
    name: agent_comparison
    mode: multi_dimension
    dimensions:
      - correctness
      - efficiency
      - safety
    require_justification: true
    allow_tie: true

तीनों modes के लिए Pairwise Comparison guide देखें।

Process Reward Annotation

प्रति-step binary correctness annotation, process reward models को train करने के लिए अनुकूलित। दो modes: first-error (पहले गलत step पर क्लिक करें, बाकी auto-marked) और per-step (प्रत्येक को स्वतंत्र रूप से rate करें)।

yaml

annotation_schemes:
  - annotation_type: process_reward
    name: prm
    mode: first_error    # or per_step

पूर्ण reference के लिए Process Reward Annotation देखें।

Per-Turn Ratings

Dialogue और बहु-चरणीय मूल्यांकनों के लिए, आपको अक्सर समग्र trace के बजाय (या उसके अतिरिक्त) व्यक्तिगत turns पर ratings की आवश्यकता होती है। Potato किसी भी display type के लिए per-turn annotation का समर्थन करता है।

yaml

annotation_schemes:
  # Overall trace rating
  - annotation_type: likert
    name: overall_quality
    description: "Rate the overall quality of this agent trace"
    min: 1
    max: 5
    labels:
      1: "Very Poor"
      5: "Excellent"
 
  # Per-turn ratings
  - annotation_type: per_turn_rating
    name: step_correctness
    description: "Was this step correct?"
    target: agentic_steps        # binds to trace steps
    rating_type: radio
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
      - "Unnecessary"
 
  - annotation_type: per_turn_rating
    name: step_explanation
    description: "Explain any issues with this step"
    target: agentic_steps
    rating_type: text
    conditional:
      show_when:
        step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]

Per-turn ratings प्रत्येक step card के बगल में inline दिखाई देती हैं। conditional block आपको follow-up प्रश्न केवल तब दिखाने देता है जब कुछ निश्चित ratings चुनी जाती हैं, जिससे interface साफ-सुथरा रहता है।

Per-Turn Output Format

Per-turn annotations step indices के साथ सहेजी जाती हैं:

json

{
  "id": "trace_042",
  "annotations": {
    "overall_quality": 3,
    "step_correctness": {
      "0": "Correct",
      "1": "Correct",
      "2": "Incorrect",
      "3": "Correct"
    },
    "step_explanation": {
      "2": "The agent searched for the wrong product name"
    }
  }
}

Agent Proxy System

ऐसे live evaluation tasks के लिए जहाँ annotators किसी agent के साथ रियल टाइम में बातचीत करते हैं, Potato एक agent proxy layer प्रदान करता है। Proxy annotation interface और agent backend के बीच बैठता है, और बाद में समीक्षा के लिए पूरी conversation को log करता है।

yaml

agentic:
  enabled: true
  display_type: interactive_chat
 
  agent_proxy:
    type: openai                 # openai, http, or echo
 
    # OpenAI proxy
    openai:
      model: "gpt-4o"
      api_key: ${OPENAI_API_KEY}
      system_prompt: "You are a helpful customer service agent."
      temperature: 0.7
      max_tokens: 1024

Proxy Types

OpenAI proxy messages को एक OpenAI-संगत API पर forward करता है:

yaml

agent_proxy:
  type: openai
  openai:
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
    system_prompt: "You are a helpful assistant."
    temperature: 0.7

HTTP proxy messages को किसी भी HTTP endpoint (आपके अपने agent server) पर forward करता है:

yaml

agent_proxy:
  type: http
  http:
    url: "https://my-agent.example.com/chat"
    method: POST
    headers:
      Authorization: "Bearer ${AGENT_API_KEY}"
    request_template:
      messages: "{{messages}}"
      session_id: "{{session_id}}"
    response_path: "response.content"
    timeout_seconds: 30

Echo proxy उपयोगकर्ता का message वापस mirror करता है (testing और UI development के लिए उपयोगी):

yaml

agent_proxy:
  type: echo
  echo:
    prefix: "[Echo] "
    delay_ms: 500

Pre-Built Annotation Schemas

Potato विशेष रूप से agent evaluation के लिए डिज़ाइन किए गए 9 annotation schemas के साथ आता है। इन्हें सीधे उपयोग करें या अपने स्वयं के schemas के लिए प्रारंभिक बिंदु के रूप में उपयोग करें।

Schema	Type	विवरण
`agent_task_success`	radio	Binary success/failure, partial credit विकल्प के साथ
`agent_step_correctness`	per_turn_rating (radio)	प्रति-step correct/incorrect/unnecessary ratings
`agent_error_taxonomy`	per_turn_rating (multiselect)	12-श्रेणी error taxonomy (wrong tool, hallucination, loop, आदि)
`agent_safety`	radio + text	Severity scale के साथ safety violation detection
`agent_efficiency`	likert	Rate करें कि agent ने efficient path का उपयोग किया या नहीं
`agent_instruction_following`	likert	मूल user instruction के पालन को rate करें
`agent_explanation_quality`	likert	Agent के reasoning/explanations की गुणवत्ता को rate करें
`agent_web_action_correctness`	per_turn_rating (radio)	प्रति-step web action मूल्यांकन (correct target, correct action type)
`agent_conversation_quality`	multirate	Multi-dimensional chat गुणवत्ता (helpfulness, accuracy, tone, safety)

किसी pre-built schema को नाम से load करें:

yaml

annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_error_taxonomy

या presets को custom schemas के साथ संयोजित करें:

yaml

annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
 
  # Custom schema alongside presets
  - annotation_type: text
    name: evaluator_notes
    description: "Any additional observations about this agent trace"
    label_requirement:
      required: false

Full Example: Evaluating a ReAct Agent

यहाँ प्रति-step ratings के साथ ReAct-शैली के agent traces का मूल्यांकन करने के लिए एक पूर्ण configuration है:

yaml

# project config
task_name: "ReAct Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/react_traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task_description
 
agentic:
  enabled: true
  trace_converter: react
  display_type: agent_trace
 
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    collapse_threshold: 300
    show_step_numbers: true
    render_json: true
 
annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_efficiency
 
  - annotation_type: text
    name: failure_reason
    description: "If the agent failed, describe what went wrong"
    label_requirement:
      required: false
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"

नमूना input data (data/react_traces.jsonl):

json

{
  "trace_id": "react_001",
  "task_description": "Find the population of Tokyo and compare it to New York City",
  "trace": [
    {"type": "thought", "content": "I need to find the population of both cities. Let me search for Tokyo first."},
    {"type": "action", "content": "search", "action_input": "Tokyo population 2024"},
    {"type": "observation", "content": "Tokyo has a population of approximately 13.96 million in the city proper..."},
    {"type": "thought", "content": "Now I need to find New York City's population."},
    {"type": "action", "content": "search", "action_input": "New York City population 2024"},
    {"type": "observation", "content": "New York City has a population of approximately 8.34 million..."},
    {"type": "thought", "content": "Tokyo (13.96M) has about 67% more people than NYC (8.34M)."},
    {"type": "action", "content": "finish", "action_input": "Tokyo has ~13.96 million people vs NYC's ~8.34 million, making Tokyo about 67% larger by population."}
  ]
}

server शुरू करें:

bash

potato start config.yaml -p 8000

एजेंटिक एनोटेशन

Trace Format Converters

Converter Reference

कॉन्फ़िगरेशन

Auto-Detection

Custom Converters

Display Types

1. Agent Trace Display

2. Web Agent Trace Display

3. Interactive Chat Display

4. Coding Trace Display

5. Live Agent Display

Advanced Annotation Types

Trajectory Evaluation (trajectory_eval)

Rubric Evaluation (rubric_eval)

Pairwise Comparison

Process Reward Annotation

Per-Turn Ratings

Per-Turn Output Format

Agent Proxy System

Proxy Types

Pre-Built Annotation Schemas

Full Example: Evaluating a ReAct Agent

Further Reading

Trajectory Evaluation (`trajectory_eval`)

Rubric Evaluation (`rubric_eval`)