AI Agents का मूल्यांकन: Agent Traces की Human Annotation के लिए संपूर्ण गाइड

Automated metrics आपको बताते हैं कि agent कितनी बार सफल होता है। Human evaluation बताती है कि यह क्यों विफल होता है, कहाँ गलत होता है, और कैसे ठीक किया जाए। यह गाइड Potato की agentic annotation सुविधाओं का उपयोग करके AI agents के लिए एक संपूर्ण human evaluation pipeline स्थापित करने की प्रक्रिया बताती है।

हम एक ReAct-style agent का मूल्यांकन करेंगे जो web खोज कर के प्रश्नों के उत्तर देता है। इस गाइड के अंत तक आपके पास होगा:

Potato में agent traces import किए हुए
Agent trace display कॉन्फ़िगर किया हुआ
समग्र और प्रति-चरण दोनों annotation schemas सेट किए हुए
Annotation task चलाया हुआ
संरचित evaluation results export किए हुए

पूर्व आवश्यकताएँ

Potato 2.3.0 या बाद का संस्करण install करें:

bash

pip install --upgrade potato-annotation

आपको evaluate करने के लिए agent traces की भी आवश्यकता होगी। यह गाइड ReAct traces का उपयोग करती है, लेकिन यही दृष्टिकोण 12 समर्थित formats में से किसी के लिए भी काम करता है।

चरण 1: अपने Agent Traces तैयार करें

एक ReAct trace, Thought, Action, और Observation चरणों का एक क्रम है। यहाँ एक न्यूनतम उदाहरण है। data/traces.jsonl पर एक फ़ाइल बनाएं:

json

{
  "trace_id": "trace_001",
  "task": "What is the capital of Australia and what is its population?",
  "trace": [
    {"type": "thought", "content": "I need to find the capital of Australia and its population. Let me search for this."},
    {"type": "action", "content": "search", "action_input": "capital of Australia"},
    {"type": "observation", "content": "Canberra is the capital city of Australia. It was chosen as the capital in 1908 as a compromise between Sydney and Melbourne."},
    {"type": "thought", "content": "The capital is Canberra. Now I need to find its population."},
    {"type": "action", "content": "search", "action_input": "Canberra population 2025"},
    {"type": "observation", "content": "Canberra has a population of approximately 467,000 as of 2025, making it the eighth-largest city in Australia."},
    {"type": "thought", "content": "I now have both pieces of information. The capital of Australia is Canberra with a population of approximately 467,000."},
    {"type": "action", "content": "finish", "action_input": "The capital of Australia is Canberra, with a population of approximately 467,000 as of 2025."}
  ],
  "ground_truth": "Canberra, approximately 467,000"
}

JSONL फ़ाइल की प्रत्येक पंक्ति एक पूर्ण agent trace है। trace field में चरण-दर-चरण log है। task field वह है जो agent से करने को कहा गया था।

Trace Format के बारे में नोट्स

OpenAI function-calling traces के लिए, format अलग दिखता है:

json

{
  "trace_id": "oai_001",
  "task": "Find cheap flights from NYC to London",
  "messages": [
    {"role": "user", "content": "Find cheap flights from NYC to London"},
    {"role": "assistant", "content": null, "tool_calls": [{"function": {"name": "search_flights", "arguments": "{\"from\": \"NYC\", \"to\": \"LHR\"}"}}]},
    {"role": "tool", "name": "search_flights", "content": "{\"flights\": [{\"airline\": \"BA\", \"price\": 450}, {\"airline\": \"AA\", \"price\": 520}]}"},
    {"role": "assistant", "content": "I found flights from NYC to London. The cheapest is British Airways at $450."}
  ]
}

Potato का converter इन अंतरों को संभालता है। आप बस सही converter का नाम निर्दिष्ट करते हैं।

चरण 2: Project Configuration बनाएं

config.yaml बनाएं:

yaml

task_name: "ReAct Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task
 
# --- Agentic annotation settings ---
agentic:
  enabled: true
  trace_converter: react
  display_type: agent_trace
 
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    collapse_threshold: 400
    show_step_numbers: true
    show_timestamps: false
    render_json: true
    syntax_highlight: true

यह Potato को बताता है:

data/traces.jsonl से traces load करें
trace field parse करने के लिए ReAct converter का उपयोग करें
Color-coded step cards के साथ agent trace display का उपयोग करके traces प्रदर्शित करें

चरण 3: अपने Annotation Schemas डिज़ाइन करें

Agent evaluation को आमतौर पर trace-level निर्णयों (क्या agent सफल हुआ?) और step-level निर्णयों (क्या प्रत्येक चरण सही था?) दोनों की आवश्यकता होती है। आइए दोनों जोड़ते हैं।

config.yaml में निम्नलिखित जोड़ें:

yaml

annotation_schemes:
  # --- Trace-level schemas ---
 
  # 1. Task success (the most important metric)
  - annotation_type: radio
    name: task_success
    description: "Did the agent successfully complete the task?"
    labels:
      - "Success"
      - "Partial Success"
      - "Failure"
    label_requirement:
      required: true
    sequential_key_binding: true
 
  # 2. Answer correctness (if the task has a ground truth)
  - annotation_type: radio
    name: answer_correctness
    description: "Is the agent's final answer factually correct?"
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
      - "Cannot Determine"
    label_requirement:
      required: true
 
  # 3. Efficiency rating
  - annotation_type: likert
    name: efficiency
    description: "Did the agent use an efficient path to the answer?"
    min: 1
    max: 5
    labels:
      1: "Very Inefficient (many unnecessary steps)"
      3: "Average"
      5: "Optimal (no wasted steps)"
 
  # 4. Free-text notes
  - annotation_type: text
    name: evaluator_notes
    description: "Any additional observations"
    label_requirement:
      required: false
 
  # --- Step-level schemas ---
 
  # 5. Per-step correctness
  - annotation_type: per_turn_rating
    name: step_correctness
    target: agentic_steps
    description: "Was this step correct and useful?"
    rating_type: radio
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
      - "Unnecessary"
 
  # 6. Per-step error type (only shown when step is not correct)
  - annotation_type: per_turn_rating
    name: error_type
    target: agentic_steps
    description: "What type of error occurred?"
    rating_type: multiselect
    labels:
      - "Wrong tool/action"
      - "Wrong arguments"
      - "Hallucinated information"
      - "Reasoning error"
      - "Redundant step"
      - "Premature termination"
      - "Other"
    conditional:
      show_when:
        step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]

यह schema design आपको देता है:

उच्च-स्तरीय विश्लेषण के लिए एक binary success/failure metric
अंतिम उत्तर के मूल्यांकन के लिए एक correctness rating
Agent strategies की तुलना के लिए एक efficiency score
यह पहचानने के लिए प्रति-चरण ratings कि agents ठीक कहाँ गलत होते हैं
एक conditional error taxonomy जो केवल तब दिखती है जब किसी चरण में समस्या हो

चरण 4: Output कॉन्फ़िगर करें और Server शुरू करें

config.yaml में output settings जोड़ें:

yaml

output_annotation_dir: "output/"
output_annotation_format: "jsonl"
 
# Optional: also export to Parquet for analysis
parquet_export:
  enabled: true
  output_dir: "output/parquet/"
  compression: zstd

संदर्भ के लिए पूरी config.yaml:

yaml

task_name: "ReAct Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task
 
agentic:
  enabled: true
  trace_converter: react
  display_type: agent_trace
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    collapse_threshold: 400
    show_step_numbers: true
    render_json: true
    syntax_highlight: true
 
annotation_schemes:
  - annotation_type: radio
    name: task_success
    description: "Did the agent successfully complete the task?"
    labels: ["Success", "Partial Success", "Failure"]
    label_requirement:
      required: true
    sequential_key_binding: true
 
  - annotation_type: radio
    name: answer_correctness
    description: "Is the agent's final answer factually correct?"
    labels: ["Correct", "Partially Correct", "Incorrect", "Cannot Determine"]
    label_requirement:
      required: true
 
  - annotation_type: likert
    name: efficiency
    description: "Did the agent use an efficient path?"
    min: 1
    max: 5
    labels:
      1: "Very Inefficient"
      3: "Average"
      5: "Optimal"
 
  - annotation_type: text
    name: evaluator_notes
    description: "Any additional observations"
    label_requirement:
      required: false
 
  - annotation_type: per_turn_rating
    name: step_correctness
    target: agentic_steps
    description: "Was this step correct?"
    rating_type: radio
    labels: ["Correct", "Partially Correct", "Incorrect", "Unnecessary"]
 
  - annotation_type: per_turn_rating
    name: error_type
    target: agentic_steps
    description: "Error type"
    rating_type: multiselect
    labels:
      - "Wrong tool/action"
      - "Wrong arguments"
      - "Hallucinated information"
      - "Reasoning error"
      - "Redundant step"
      - "Premature termination"
      - "Other"
    conditional:
      show_when:
        step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]
 
output_annotation_dir: "output/"
output_annotation_format: "jsonl"
 
parquet_export:
  enabled: true
  output_dir: "output/parquet/"
  compression: zstd

Server शुरू करें:

bash

potato start config.yaml -p 8000

अपने browser में http://localhost:8000 खोलें।

चरण 5: Annotation Workflow

जब कोई annotator एक trace खोलता है, तो वे देखते हैं:

Task description शीर्ष पर (मूल user query)
Step cards जो पूरे agent trace को color-coded प्रकार से दिखाते हैं:
- विचारों/reasoning के लिए Purple cards
- Actions/tool calls के लिए Blue cards
- Observations/results के लिए Green cards
- Errors के लिए Red cards
प्रत्येक step card के बगल में Per-step rating controls
Trace display के नीचे Trace-level schemas

सामान्य workflow:

Task description पढ़ें ताकि समझ सकें agent को क्या करना था
Trace steps के माध्यम से जाएं, प्रत्येक को rating दें
किसी भी step के लिए जो "Partially Correct" या "Incorrect" रेट हो, error type(s) चुनें
समग्र trace को rate करें (success, correctness, efficiency)
यदि आवश्यक हो तो notes जोड़ें
Submit करें और अगले trace पर जाएं

Annotators के लिए सुझाव

Collapsed observations को expand करें यह verify करने के लिए कि agent ने जानकारी को सही ढंग से process किया
अंतिम उत्तर की तुलना करें ground truth के साथ (यदि उपलब्ध हो) task success rating देने से पहले
"Unnecessary" steps को rate करें "Incorrect" से अलग -- एक अनावश्यक चरण प्रयास बर्बाद करता है लेकिन errors नहीं लाता
Step timeline sidebar का उपयोग करें लंबे traces में विशिष्ट steps पर जाने के लिए

चरण 6: Results का विश्लेषण करें

Annotation के बाद, results का programmatically विश्लेषण करें।

pandas के साथ बुनियादी विश्लेषण

python

import pandas as pd
import json
 
# Load annotations
annotations = []
with open("output/annotations.jsonl") as f:
    for line in f:
        annotations.append(json.loads(line))
 
df = pd.DataFrame(annotations)
 
# Task success rate
success_counts = df.groupby("annotations").apply(
    lambda x: x.iloc[0]["annotations"]["task_success"]
).value_counts()
print("Task Success Distribution:")
print(success_counts)
 
# Average efficiency rating
efficiency_scores = [
    a["annotations"]["efficiency"]
    for a in annotations
    if "efficiency" in a["annotations"]
]
print(f"\nAverage Efficiency: {sum(efficiency_scores) / len(efficiency_scores):.2f}")

Step-Level Error विश्लेषण

python

# Collect all step-level errors
error_counts = {}
for ann in annotations:
    step_errors = ann["annotations"].get("error_type", {})
    for step_idx, errors in step_errors.items():
        for error in errors:
            error_counts[error] = error_counts.get(error, 0) + 1
 
print("Error Type Distribution:")
for error, count in sorted(error_counts.items(), key=lambda x: -x[1]):
    print(f"  {error}: {count}")

DuckDB के साथ विश्लेषण (Parquet के माध्यम से)

python

import duckdb
 
# Overall success rate
result = duckdb.sql("""
    SELECT value, COUNT(*) as count
    FROM 'output/parquet/annotations.parquet'
    WHERE schema_name = 'task_success'
    GROUP BY value
    ORDER BY count DESC
""")
print(result)

चरण 7: Scale Up करें

बड़े evaluation projects (सैकड़ों या हजारों traces) के लिए, इन configurations पर विचार करें:

एकाधिक Annotators

Inter-annotator agreement के लिए प्रति trace कई annotators assign करें:

yaml

annotation_task_config:
  total_annotations_per_instance: 3
  assignment_strategy: random

Pre-Built Schemas का उपयोग करें

त्वरित setup के लिए, Potato के pre-built agent evaluation schemas का उपयोग करें:

yaml

annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_error_taxonomy
  - preset: agent_efficiency

Quality Control

Quality monitoring के लिए gold-standard instances सक्षम करें:

yaml

phases:
  training:
    enabled: true
    data_file: "data/training_traces.jsonl"
    passing_criteria:
      min_correct: 4
      total_questions: 5

अन्य Agent Types के लिए अनुकूलन

OpenAI Function Calling

yaml

agentic:
  enabled: true
  trace_converter: openai
  display_type: agent_trace

Anthropic Tool Use

yaml

agentic:
  enabled: true
  trace_converter: anthropic
  display_type: agent_trace

Multi-Agent Systems (CrewAI/AutoGen)

yaml

agentic:
  enabled: true
  trace_converter: multi_agent
  display_type: agent_trace
  multi_agent:
    agent_converters:
      researcher: react
      writer: anthropic
      reviewer: openai

Web Browsing Agents

Web agents के लिए, web agent display पर स्विच करें:

yaml

agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent
  web_agent_display:
    screenshot_max_width: 900
    overlay:
      enabled: true
    filmstrip:
      enabled: true

एक dedicated गाइड के लिए Web Browsing Agents Annotate करना देखें।

सारांश

AI agents का human evaluation विशेष tooling की आवश्यकता रखता है। Potato का agentic annotation system प्रदान करता है:

12 converters किसी भी framework से traces को normalize करने के लिए
3 display types tool-use, web browsing, और conversational agents के लिए optimized
Per-turn ratings step-level evaluation के लिए
9 pre-built schemas सामान्य evaluation dimensions को कवर करते हुए
Parquet export कुशल downstream विश्लेषण के लिए

मुख्य insight यह है कि agent evaluation केवल "क्या agent को सही उत्तर मिला?" नहीं है -- यह है "क्या agent ने हर चरण में सही ढंग से reasoning की?" Per-step annotation ऐसे error patterns को उजागर करता है जो aggregate metrics छोड़ देते हैं।

आगे पढ़ें

Agentic Annotation Documentation
Web Browsing Agents Annotate करना
Solo Mode -- agentic annotation को human-LLM collaborative evaluation के साथ जोड़ें
Best-Worst Scaling -- agent outputs को तुलनात्मक रूप से rank करें
Parquet Export -- विश्लेषण के लिए कुशल export