Potato 2.3: Agentic Annotation, Solo Mode, और Human Evaluation का भविष्य

हम Potato 2.3.0 की घोषणा करते हुए उत्साहित हैं, जो Potato के इतिहास में सबसे बड़ा release है। यह update दो प्रमुख नए systems -- agentic annotation और Solo Mode -- के साथ-साथ Best-Worst Scaling, SSO/OAuth authentication, Parquet export, और 15 नए demo projects प्रस्तुत करता है।

इस release की थीम सरल है: जो चीज़ें हमें annotate करने की आवश्यकता है वे बदल गई हैं, और हमारे tools को साथ चलना होगा। Researchers अब केवल text sentiment और named entities label नहीं कर रहे। वे multi-step AI agent traces का मूल्यांकन कर रहे हैं, LLM outputs को scale पर compare कर रहे हैं, और increasingly complex tasks के लिए datasets बना रहे हैं। Potato 2.3 इस नई वास्तविकता के लिए बनाया गया है।

Agentic Annotation

Potato 2.3 की headline feature human annotation के माध्यम से AI agents का मूल्यांकन करने के लिए एक complete system है।

AI agents -- वे systems जो tasks accomplish करने के लिए multi-step actions लेते हैं -- तेजी से बढ़ रहे हैं। लेकिन उनका मूल्यांकन करना कठिन है। एक single agent run में दर्जनों tool calls, reasoning steps, web page navigations, और intermediate outputs शामिल हो सकते हैं। मौजूदा annotation tools agents के outputs को flat text के रूप में दिखाते हैं, उस rich structure को खोते हुए जो evaluators को देखने की आवश्यकता है।

Potato का agentic annotation system इसे तीन components के साथ हल करता है।

12 Trace Format Converters

Agent traces framework के आधार पर अलग-अलग formats में आते हैं। Potato उन सभी को एक unified representation में normalize करता है:

Converter	Source
`openai`	OpenAI Assistants API / function calling
`anthropic`	Anthropic Claude tool_use / Messages API
`swebench`	SWE-bench task traces
`opentelemetry`	OpenTelemetry span exports
`mcp`	Model Context Protocol sessions
`multi_agent`	CrewAI / AutoGen / LangGraph
`langchain`	LangChain callback traces
`langfuse`	LangFuse observation exports
`react`	ReAct Thought/Action/Observation
`webarena`	WebArena / VisualWebArena
`atif`	Agent Trace Interchange Format
`raw_web`	Raw browser recordings (HAR + screenshots)

Configuration सीधी है:

yaml

agentic:
  enabled: true
  trace_converter: react
  trace_file: "data/agent_traces.jsonl"

Auto-detection उन projects के लिए उपलब्ध है जिन्हें कई sources से traces ingest करने की आवश्यकता है:

yaml

agentic:
  enabled: true
  trace_converter: auto

तीन Display Types

अलग-अलग agent modalities को अलग visualizations की आवश्यकता है।

Agent Trace Display tool-using agent traces को collapsible observations, JSON pretty-printing, और timeline sidebar के साथ color-coded step cards के रूप में render करता है:

yaml

agentic:
  display_type: agent_trace
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    show_step_numbers: true

Web Agent Trace Display browsing agent traces को full screenshots, click targets और input fields दिखाने वाले SVG overlays, और quick navigation के लिए filmstrip के साथ render करता है:

yaml

agentic:
  display_type: web_agent
  web_agent_display:
    screenshot_max_width: 900
    overlay:
      enabled: true
      click_marker: "circle"
      click_color: "#ef4444"
    filmstrip:
      enabled: true

Interactive Chat Display trace review (एक recorded conversation का मूल्यांकन करना) और live chat (annotators real time में एक agent के साथ interact करते हैं, फिर conversation का मूल्यांकन करते हैं) दोनों को support करता है:

yaml

agentic:
  display_type: interactive_chat
  interactive_chat_display:
    mode: trace_review
    trace_review:
      show_token_counts: true
      show_latency: true

Per-Turn Ratings

किसी भी display type के लिए, annotators समग्र trace के साथ individual steps rate कर सकते हैं:

yaml

annotation_schemes:
  - annotation_type: likert
    name: overall_quality
    min: 1
    max: 5
 
  - annotation_type: per_turn_rating
    name: step_correctness
    target: agentic_steps
    rating_type: radio
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"

Pre-Built Schemas

नौ annotation schemas out of the box सामान्य agent evaluation dimensions को cover करते हैं:

yaml

annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_error_taxonomy
  - preset: agent_safety

उपलब्ध presets: agent_task_success, agent_step_correctness, agent_error_taxonomy, agent_safety, agent_efficiency, agent_instruction_following, agent_explanation_quality, agent_web_action_correctness, agent_conversation_quality।

Agentic Annotation documentation पढ़ें →

Solo Mode

Potato 2.3 की दूसरी प्रमुख feature Solo Mode है: एक 12-phase workflow जो traditional multi-annotator paradigm को एक LLM के साथ collaborate करने वाले single human expert से बदल देता है।

समस्या

Traditional annotation को reliability के लिए कई annotators की आवश्यकता है। लेकिन एक team को hire करना, train करना, और coordinate करना महंगा और धीमा है। कई research projects के लिए, annotation bottleneck interface नहीं है -- यह logistics है।

समाधान

Solo Mode एक domain expert को data का एक strategically selected subset label करने देता है। एक LLM उन labels से सीखता है, remaining instances के लिए labels propose करता है, और human केवल उन cases की समीक्षा करता है जहाँ LLM struggle करता है। एक 12-phase workflow इसे automatically orchestrate करता है।

Internal benchmarks में, Solo Mode ने पूर्ण multi-annotator pipelines के साथ 95%+ agreement प्राप्त की, जबकि केवल 10-15% total human labels की आवश्यकता थी।

12 Phases

Seed Annotation -- human 50 diverse instances label करता है
Initial LLM Calibration -- LLM seed examples का उपयोग करके calibration batch label करता है
Confusion Analysis -- systematic human-LLM disagreement patterns पहचानें
Guideline Refinement -- LLM improved guidelines propose करता है; human approve करता है
Labeling Function Generation -- आसान instances के लिए ALCHEmist-inspired programmatic rules
Active Labeling -- human सबसे informative remaining instances label करता है
Automated Refinement Loop -- updated guidelines के साथ iterative re-labeling
Disagreement Exploration -- human उन cases resolve करता है जहाँ LLM और labeling functions conflict करते हैं
Edge Case Synthesis -- LLM human labeling के लिए synthetic ambiguous examples generate करता है
Cascaded Confidence Escalation -- human lowest-confidence LLM labels review करता है
Prompt Optimization -- DSPy-inspired automated prompt search
Final Validation -- random sample review; pass या cycle back

Quick Start

yaml

solo_mode:
  enabled: true
  llm:
    endpoint_type: openai
    model: "gpt-4o"
    api_key: ${OPENAI_API_KEY}
  seed_count: 50
  accuracy_threshold: 0.92
  confidence_threshold: 0.85
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    labels: [Positive, Neutral, Negative]

Multi-Signal Instance Prioritization

Solo Mode human labeling के लिए सबसे मूल्यवान instances select करने के लिए छह weighted pools का उपयोग करता है:

yaml

solo_mode:
  prioritization:
    pools:
      - name: uncertain
        weight: 0.30
      - name: disagreement
        weight: 0.25
      - name: boundary
        weight: 0.20
      - name: novel
        weight: 0.10
      - name: error_pattern
        weight: 0.10
      - name: random
        weight: 0.05

Solo Mode documentation पढ़ें →

Best-Worst Scaling

Potato 2.3 Best-Worst Scaling (BWS) जोड़ता है, जिसे Maximum Difference Scaling भी कहा जाता है। Annotators items का एक tuple (आमतौर पर 4) देखते हैं और किसी criterion के अनुसार best और worst चुनते हैं। BWS सरल binary judgments से reliable scalar scores produce करता है, same statistical power के लिए Likert scales की तुलना में बहुत कम annotations की आवश्यकता होती है।

yaml

annotation_schemes:
  - annotation_type: best_worst_scaling
    name: fluency
    description: "Select the BEST and WORST translation"
    items_key: "translations"
    tuple_size: 4
    best_label: "Most Fluent"
    worst_label: "Least Fluent"
    randomize_order: true
 
    tuple_generation:
      method: balanced_incomplete
      tuples_per_item: 5
 
    scoring:
      method: bradley_terry
      auto_compute: true
      include_confidence: true

तीन scoring methods उपलब्ध हैं:

Counting -- simple (best_count - worst_count) / appearances
Bradley-Terry -- pairwise comparison model (recommended default)
Plackett-Luce -- maximum data efficiency के लिए full ranking model

CLI से score करें:

bash

python -m potato.bws score --config config.yaml --method bradley_terry --output scores.csv

Admin dashboard में score distributions, convergence charts, और split-half reliability metrics के साथ एक BWS tab शामिल है।

Best-Worst Scaling documentation पढ़ें →

SSO और OAuth Authentication

Production annotation deployments को proper authentication की आवश्यकता है। Potato 2.3 तीन OAuth methods support करता है:

Google OAuth

yaml

authentication:
  method: google_oauth
  google_oauth:
    client_id: ${GOOGLE_CLIENT_ID}
    client_secret: ${GOOGLE_CLIENT_SECRET}
    redirect_uri: "https://annotation.example.com/auth/google/callback"
    allowed_domains:
      - "umich.edu"
    auto_register: true

GitHub OAuth with Organization Restriction

yaml

authentication:
  method: github_oauth
  github_oauth:
    client_id: ${GITHUB_CLIENT_ID}
    client_secret: ${GITHUB_CLIENT_SECRET}
    redirect_uri: "https://annotation.example.com/auth/github/callback"
    allowed_organizations:
      - "my-research-lab"
    scopes:
      - "read:user"
      - "read:org"

Generic OIDC

Okta, Azure AD, Auth0, Keycloak, या किसी भी OIDC-compliant provider से connect करें:

yaml

authentication:
  method: oidc
  oidc:
    discovery_url: "https://accounts.example.com/.well-known/openid-configuration"
    client_id: ${OIDC_CLIENT_ID}
    client_secret: ${OIDC_CLIENT_SECRET}
    redirect_uri: "https://annotation.example.com/auth/oidc/callback"

सभी methods domain restriction, auto-registration, और mixed mode (एक login page पर कई auth methods) support करते हैं।

SSO और OAuth documentation पढ़ें →

Parquet Export

Annotation data को increasingly data science tools द्वारा consume किया जाता है जो columnar formats expect करते हैं। Potato 2.3 annotations को directly Apache Parquet में export कर सकता है, तीन structured files produce करता है:

annotations.parquet -- प्रति (instance, annotator, schema) एक row, values, timestamps, और durations के साथ
spans.parquet -- प्रति annotated span एक row, offsets, labels, और links के साथ
items.parquet -- annotation counts और status के साथ instance metadata

yaml

parquet_export:
  enabled: true
  output_dir: "output/parquet/"
  compression: zstd
  auto_export: true

pandas, DuckDB, PyArrow, Polars, या Hugging Face Datasets में directly load करें:

python

import pandas as pd
annotations = pd.read_parquet("output/parquet/annotations.parquet")
 
# Or with DuckDB for SQL queries
import duckdb
duckdb.sql("""
  SELECT instance_id, value, COUNT(*) as n
  FROM 'output/parquet/annotations.parquet'
  WHERE schema_name = 'sentiment'
  GROUP BY instance_id, value
""")

snappy, gzip, zstd, lz4, और brotli compression, date/annotator partitioning के साथ incremental export, और string columns के लिए dictionary encoding support करता है।

Parquet Export documentation पढ़ें →

15 नए Demo Projects

Potato 2.3 project-hub/ directory में 15 नए demo projects के साथ आता है, नई features को cover करते हुए:

Agentic Annotation Demos

react-agent-eval -- Step-level ratings के साथ ReAct agent traces का मूल्यांकन
web-agent-eval -- Screenshot overlays के साथ WebArena trace evaluation
chatbot-eval -- Live agent proxy के साथ interactive chat evaluation
multi-agent-eval -- CrewAI multi-agent systems का मूल्यांकन
swebench-eval -- Coding agents के लिए SWE-bench trace evaluation

Solo Mode Demos

solo-sentiment -- Product reviews पर Solo Mode sentiment classification
solo-ner -- Solo Mode named entity recognition
solo-toxicity -- Edge case synthesis के साथ Solo Mode toxicity detection

Best-Worst Scaling Demos

bws-translation -- Machine translation quality ranking
bws-summarization -- Summary quality comparison
bws-image-quality -- Image generation quality ranking

Authentication Demos

google-oauth-demo -- Google OAuth setup example
github-oauth-demo -- Org restriction के साथ GitHub OAuth

Export Demos

parquet-export-demo -- DuckDB analysis script के साथ Parquet export
huggingface-upload -- Parquet में export और Hugging Face Hub पर push

प्रत्येक demo में एक complete config.yaml, sample data, और setup instructions के साथ एक README शामिल है। किसी भी demo को इस तरह शुरू करें:

bash

cd project-hub/react-agent-eval
potato start config.yaml

Security Hardening

Potato 2.3 में कई security improvements शामिल हैं:

Session tokens configurable expiration के साथ cryptographically secure random generation का उपयोग करते हैं
CSRF protection सभी form submissions के लिए default रूप से enabled है
Authentication endpoints पर Rate limiting (configurable, default 10 attempts per minute)
Annotation interface में display किए गए सभी user-provided content के लिए Input sanitization
Dependency audit -- सभी Python और JavaScript dependencies को latest secure versions में update किया गया
XSS को रोकने के लिए Content Security Policy headers जोड़े गए

yaml

security:
  csrf_protection: true
  rate_limiting:
    auth_attempts: 10            # per minute
    api_requests: 100            # per minute
  session:
    token_length: 64
    lifetime_hours: 24
  content_security_policy: true

Upgrade करना

Potato 2.2.x से

bash

pip install --upgrade potato-annotation

सभी v2.2 configurations पूरी तरह backward-compatible हैं। मौजूदा configs में कोई बदलाव आवश्यक नहीं है।

नई Dependencies

Parquet export के लिए PyArrow की आवश्यकता है:

bash

pip install potato-annotation[parquet]

Solo Mode के लिए एक LLM provider SDK की आवश्यकता है:

bash

pip install potato-annotation[solo]    # installs openai + anthropic SDKs

या सब कुछ install करें:

bash

pip install potato-annotation[all]

आगे क्या है

Potato 2.3 annotation tools क्या कर सकते हैं इसका एक महत्वपूर्ण विस्तार है। हम पहले से ही features के अगले set पर काम कर रहे हैं:

Annotation diffing -- visual diffs के साथ rounds और annotators में annotations compare करें
Federated annotation -- कई Potato instances में annotation coordinate करें
Streaming data sources -- Kafka, Pub/Sub, और अन्य streaming systems से data annotate करें
Mobile-optimized interface -- tablets और phones के लिए responsive annotation

हम आपकी feedback सुनना पसंद करेंगे। GitHub पर issues file करें, GitHub Discussions में discussion join करें, या सीधे team से संपर्क करें।

Potato 2.3: Agentic Annotation, Solo Mode, और Human Evaluation का भविष्य

Agentic Annotation

12 Trace Format Converters

तीन Display Types

Per-Turn Ratings

Pre-Built Schemas

Solo Mode

समस्या

समाधान

12 Phases

Quick Start

Multi-Signal Instance Prioritization

Best-Worst Scaling

SSO और OAuth Authentication

Google OAuth

GitHub OAuth with Organization Restriction

Generic OIDC

Parquet Export

15 नए Demo Projects

Agentic Annotation Demos

Solo Mode Demos

Best-Worst Scaling Demos

Authentication Demos

Export Demos

Security Hardening

Upgrade करना

Potato 2.2.x से

नई Dependencies

आगे क्या है

Links