Question 1

क्या मैं Claude Code, Cursor या SWE-Agent जैसे कोडिंग एजेंट्स के traces का मूल्यांकन कर सकता हूँ?

Accepted Answer

हाँ। Potato में Claude Code, OpenCode, Cursor, Aider और SWE-Agent के लिए नेटिव trace कन्वर्टर्स हैं। tool calls को विशेष रूप से डिज़ाइन की गई UI में प्रस्तुत किया जाता है: Edit/Write के लिए लाल/हरे रंग का unified diff दृश्य, Bash के लिए डार्क मोनोस्पेस टर्मिनल ब्लॉक्स, Read/Grep के लिए लाइन-नंबर वाला कोड, और एक file tree साइडबार जो सभी बदले गए फ़ाइलों को ऑपरेशन के अनुसार समूहित करती है। लंबे आउटपुट स्वचालित रूप से कोलैप्स हो जाते हैं।

Question 2

क्या मैं वेब ब्राउज़ करने वाले एजेंट्स का मूल्यांकन कर सकता हूँ?

Accepted Answer

हाँ। Potato में एक Web Agent डिस्प्ले शामिल है जिसमें क्लिक मार्कर, bounding boxes, माउस paths और scroll indicators के लिए SVG overlays हैं। दो मोड हैं: Review Mode जो पहले से रिकॉर्ड किए गए स्क्रीनशॉट्स के बीच filmstrip नेविगेशन देता है, और Creation Mode जो iframe-आधारित live वेब ब्राउज़िंग के साथ इंटरैक्शन्स की स्वचालित रिकॉर्डिंग करता है। WebArena, Mind2Web और Anthropic Computer Use फ़ॉर्मेट्स के लिए trace कन्वर्टर्स साथ आते हैं।

Question 3

क्या मैं कई सहयोगी एजेंटों वाले मल्टी-एजेंट सिस्टम का मूल्यांकन कर सकता हूँ?

Accepted Answer

हाँ। Potato एक मल्टी-एजेंट रन को एजेंटों और हैंडऑफ़ के एक क्लिक करने योग्य इंटरैक्शन ग्राफ़ के रूप में रेंडर करता है, और किसी विफलता को ज़िम्मेदार एजेंट तथा चरण पर आरोपित करने, इंटर-एजेंट गलत-संरेखण के लिए प्रत्येक हैंडऑफ़ की समीक्षा करने, प्रत्येक एजेंट और टीम को स्कोर करने, और एजेंटों के पार tool contention तथा emergent behavior को टैग करने के लिए स्कीमा जोड़ता है। मल्टी-एजेंट टीम मूल्यांकन दस्तावेज़ देखें।

Question 4

क्या मैं computer-use, voice, या video एजेंटों का मूल्यांकन कर सकता हूँ?

Accepted Answer

हाँ। Potato में मल्टीमॉडल एजेंटों के लिए विशेष रूप से बनाई गई स्कीमा हैं: प्रति-चरण स्क्रीनशॉट और click grounding के साथ GUI/computer-use ट्रेजेक्टरी, barge-in detection के साथ फ़ुल-डुप्लेक्स voice टाइमलाइन, मॉडल के पूर्वानुमान के विरुद्ध एक लाइव IoU के साथ video temporal grounding, संरेखित speech-transcript त्रुटि टैगिंग, इंटरलीव्ड मल्टीमॉडल तर्क, और document table-grid संरचना। मल्टीमॉडल-एजेंट मूल्यांकन दस्तावेज़ देखें।

Question 5

क्या एनोटेटर्स किसी AI एजेंट को वास्तविक समय में वेब ब्राउज़ करते हुए देख सकते हैं?

Accepted Answer

हाँ। Live Agent मोड एक LLM विज़न मॉडल (Playwright के माध्यम से Anthropic Claude) को एक headless ब्राउज़र से जोड़ता है। एजेंट स्क्रीनशॉट लेता है, LLM कार्रवाइयों की योजना बनाता है, और Potato सेशन को Server-Sent Events के माध्यम से एनोटेटर तक स्ट्रीम करता है। एनोटेटर सेशन के बीच में pause कर सकते हैं, निर्देश भेज सकते हैं, या manual control अपने हाथ में ले सकते हैं। `live_agent` डिस्प्ले प्रकार के माध्यम से कॉन्फ़िगर करें।

Question 6

क्या मैं मूल्यांकन के दौरान किसी एजेंट सेशन को rewind, branch या replay कर सकता हूँ?

Accepted Answer

हाँ। coding agent मोड किसी भी step पर checkpoint/rollback और वैकल्पिक trajectories देखने के लिए branching/replay का समर्थन करता है। यह counterfactual मूल्यांकन, एजेंट निर्णयों के बीच A/B तुलना और उच्च-गुणवत्ता वाले प्रशिक्षण डेटा के संग्रह के लिए उपयोगी है जहाँ एनोटेटर एजेंट के एक रन को धीरे-धीरे परिष्कृत करते हैं।

Question 7

क्या मैं एजेंट trajectory के व्यक्तिगत step स्तर पर त्रुटियों को एनोटेट कर सकता हूँ?

Accepted Answer

हाँ। trajectory_eval स्कीमा (TRAIL और AgentRewardBench पर आधारित) प्रत्येक step को एक कार्ड के रूप में दिखाता है। एनोटेटर correctness को चिह्नित करते हैं, उपप्रकारों (reasoning, execution, safety, आदि) के साथ कॉन्फ़िगर करने योग्य taxonomy से error types को वर्गीकृत करते हैं, weighted scores के साथ severity assign करते हैं, और प्रति-step rationales लिखते हैं। स्वचालित रूप से गणना किया गया quality score पूरे trajectory में severity penalties को aggregate करता है।

Question 8

क्या मैं process reward model (PRM) और code review प्रशिक्षण डेटा एकत्र कर सकता हूँ?

Accepted Answer

हाँ। Potato coding agents के step-level मूल्यांकन के लिए process reward और code review स्कीमा भेजता है। दोनों एनोटेशन प्रकार downstream RLHF प्रशिक्षण के लिए सीधे PRM और DPO फ़ॉर्मेट्स में निर्यात होते हैं। coding-agent-evaluation उदाहरण प्रोजेक्ट देखें।

Question 9

क्या एनोटेटर किसी एजेंट का मूल्यांकन करते समय LLM से मदद माँग सकते हैं?

Accepted Answer

हाँ। LLM Chat Sidebar एक collapsible AI assistant panel है जिसमें multi-turn बातचीत होती है। यह task विवरण, label set और current instance text को context के रूप में प्राप्त करता है। OpenAI, Anthropic और Ollama के लिए नेटिव multi-turn समर्थन। सभी बातचीतें बाद में एनोटेटर-LLM सहयोग के विश्लेषण के लिए behavioral data के रूप में लॉग की जाती हैं।

Question 10

Can I use Potato with agents built on LangChain?

Accepted Answer

Yes. Potato converts LangChain/LangSmith traces automatically.

Question 11

क्या मैं अपने LangChain ऐप से एजेंट traces स्वचालित रूप से capture कर सकता हूँ?

Accepted Answer

हाँ। `pip install potato-annotation[langchain]` इंस्टॉल करें और अपनी chain में `PotatoCallbackHandler` को attach करें। यह chain/LLM/tool runs के parent-child संबंधों को ट्रैक करता है और root chain के पूरा होने पर Potato को LangSmith-संगत payloads भेजता है। webhook receiver के साथ मिलकर, आप manual export के बिना annotation queues में live एजेंट traces ingest कर सकते हैं।

Question 12

Potato out-of-the-box किन एजेंट trace फ़ॉर्मेट्स का समर्थन करता है?

Accepted Answer

तीन श्रेणियों में तेरह फ़ॉर्मेट्स। **Frameworks**: LangChain, LangFuse, OpenAI, Anthropic, MCP (Model Context Protocol), OpenTelemetry, ATIF। **Web agents**: WebArena, raw web traces। **Coding agents**: Claude Code, Aider, SWE-Agent। साथ ही किसी भी कस्टम फ़ॉर्मेट के लिए `structured_turns` स्कीमा के साथ एक generic JSONL ingestion path। पूरी सूची /integrations पर देखें।

Question 13

क्या मैं एक ही एजेंट एनोटेशन task में कई मूल्यांकन स्कीमा को संयोजित कर सकता हूँ?

Accepted Answer

हाँ। एक coding-agent प्रोजेक्ट उसी trace पर trajectory_eval (per-step त्रुटियाँ), span एनोटेशन (एजेंट reasoning में hallucinations को highlight करना), pairwise तुलना (किस एजेंट ने बेहतर किया) और likert rating (overall quality) को layer कर सकता है। Potato की multi-schema architecture के कारण एनोटेटर एक ही trace के लिए सभी schemas को एक ही interface में देखते हैं।

Question 14

Do I need a GPU or API key for live agent evaluation?

Accepted Answer

No. The live agent supports Ollama for fully local inference with no API key.

Question 15

Can I evaluate multi-agent systems?

Accepted Answer

Yes. Potato supports CrewAI, AutoGen, and LangGraph trace formats.

Question 16

What if my agent framework is not listed?

Accepted Answer

Use the generic ReAct converter or the webhook API to send traces in any JSON format.

Question 17

Can annotators interact with agents during evaluation?

Accepted Answer

Yes. Live agent mode lets annotators pause the agent, send instructions, or take over manual control.

Question 18

How do I export agent annotations for training?

Accepted Answer

Use the agent_eval exporter: python -m potato.export -f agent_eval -o results/.

Agent Evaluation

Agent Evaluation

अभी भी प्रश्न हैं?