تخبرك المقاييس الآلية بعدد مرات نجاح الوكيل. أما التقييم البشري فيخبرك لماذا يفشل، وأين يخطئ، وكيف يمكن إصلاحه. يرشدك هذا الدليل خلال إعداد خط أنابيب تقييم بشري كامل لوكلاء الذكاء الاصطناعي باستخدام ميزات التوصيف الوكيلي في Potato.

سنقوم بتقييم وكيل بأسلوب ReAct يجيب عن الأسئلة عبر البحث في الويب. بنهاية هذا الدليل، ستكون قد:

استوردت مسارات الوكلاء إلى Potato
أعددت إعدادات عرض مسار الوكيل
أنشأت مخططات توصيف على مستوى المسار وعلى مستوى كل خطوة
شغّلت مهمة التوصيف
صدّرت نتائج التقييم المهيكلة

المتطلبات الأساسية

قم بتثبيت Potato 2.3.0 أو أحدث:

bash

pip install --upgrade potato-annotation

ستحتاج أيضاً إلى مسارات وكلاء لتقييمها. يستخدم هذا الدليل مسارات ReAct، لكن نفس النهج يعمل مع أي من التنسيقات الـ 12 المدعومة.

الخطوة 1: تحضير مسارات الوكيل

مسار ReAct هو سلسلة من خطوات التفكير والفعل والملاحظة. إليك مثالاً بسيطاً. أنشئ ملفاً في data/traces.jsonl:

json

{
  "trace_id": "trace_001",
  "task": "What is the capital of Australia and what is its population?",
  "trace": [
    {"type": "thought", "content": "I need to find the capital of Australia and its population. Let me search for this."},
    {"type": "action", "content": "search", "action_input": "capital of Australia"},
    {"type": "observation", "content": "Canberra is the capital city of Australia. It was chosen as the capital in 1908 as a compromise between Sydney and Melbourne."},
    {"type": "thought", "content": "The capital is Canberra. Now I need to find its population."},
    {"type": "action", "content": "search", "action_input": "Canberra population 2025"},
    {"type": "observation", "content": "Canberra has a population of approximately 467,000 as of 2025, making it the eighth-largest city in Australia."},
    {"type": "thought", "content": "I now have both pieces of information. The capital of Australia is Canberra with a population of approximately 467,000."},
    {"type": "action", "content": "finish", "action_input": "The capital of Australia is Canberra, with a population of approximately 467,000 as of 2025."}
  ],
  "ground_truth": "Canberra, approximately 467,000"
}

كل سطر في ملف JSONL يمثل مساراً كاملاً لوكيل واحد. حقل trace يحتوي على السجل خطوة بخطوة. حقل task هو ما طُلب من الوكيل فعله.

ملاحظات حول تنسيق المسار

بالنسبة لـ مسارات استدعاء الدوال في OpenAI، يبدو التنسيق مختلفاً:

json

{
  "trace_id": "oai_001",
  "task": "Find cheap flights from NYC to London",
  "messages": [
    {"role": "user", "content": "Find cheap flights from NYC to London"},
    {"role": "assistant", "content": null, "tool_calls": [{"function": {"name": "search_flights", "arguments": "{\"from\": \"NYC\", \"to\": \"LHR\"}"}}]},
    {"role": "tool", "name": "search_flights", "content": "{\"flights\": [{\"airline\": \"BA\", \"price\": 450}, {\"airline\": \"AA\", \"price\": 520}]}"},
    {"role": "assistant", "content": "I found flights from NYC to London. The cheapest is British Airways at $450."}
  ]
}

يتعامل محوّل Potato مع هذه الاختلافات. ما عليك سوى تحديد اسم المحوّل الصحيح.

الخطوة 2: إنشاء إعدادات المشروع

أنشئ config.yaml:

yaml

annotation_task_name: "ReAct Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task
 
# --- Agentic annotation settings ---
agentic:
  enabled: true
  trace_converter: react
  display_type: agent_trace
 
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    collapse_threshold: 400
    show_step_numbers: true
    show_timestamps: false
    render_json: true
    syntax_highlight: true

هذا يوجه Potato إلى:

تحميل المسارات من data/traces.jsonl
استخدام محوّل ReAct لتحليل حقل trace
عرض المسارات باستخدام عرض مسار الوكيل مع بطاقات خطوات ملونة

الخطوة 3: تصميم مخططات التوصيف

يحتاج تقييم الوكلاء عادةً إلى أحكام على مستوى المسار (هل نجح الوكيل؟) وأحكام على مستوى الخطوة (هل كانت كل خطوة صحيحة؟). لنضف كليهما.

أضف ما يلي إلى config.yaml:

yaml

annotation_schemes:
  # --- Trace-level schemas ---
 
  # 1. Task success (the most important metric)
  - annotation_type: radio
    name: task_success
    description: "Did the agent successfully complete the task?"
    labels:
      - "Success"
      - "Partial Success"
      - "Failure"
    label_requirement:
      required: true
    sequential_key_binding: true
 
  # 2. Answer correctness (if the task has a ground truth)
  - annotation_type: radio
    name: answer_correctness
    description: "Is the agent's final answer factually correct?"
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
      - "Cannot Determine"
    label_requirement:
      required: true
 
  # 3. Efficiency rating
  - annotation_type: likert
    name: efficiency
    description: "Did the agent use an efficient path to the answer?"
    min: 1
    max: 5
    labels:
      1: "Very Inefficient (many unnecessary steps)"
      3: "Average"
      5: "Optimal (no wasted steps)"
 
  # 4. Free-text notes
  - annotation_type: text
    name: evaluator_notes
    description: "Any additional observations"
    label_requirement:
      required: false
 
  # --- Step-level schemas ---
 
  # 5. Per-step correctness
  - annotation_type: per_turn_rating
    name: step_correctness
    target: agentic_steps
    description: "Was this step correct and useful?"
    rating_type: radio
    labels:
      - "Correct"
      - "Partially Correct"
      - "Incorrect"
      - "Unnecessary"
 
  # 6. Per-step error type (only shown when step is not correct)
  - annotation_type: per_turn_rating
    name: error_type
    target: agentic_steps
    description: "What type of error occurred?"
    rating_type: multiselect
    labels:
      - "Wrong tool/action"
      - "Wrong arguments"
      - "Hallucinated information"
      - "Reasoning error"
      - "Redundant step"
      - "Premature termination"
      - "Other"
    conditional:
      show_when:
        step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]

يمنحك تصميم المخطط هذا:

مقياس نجاح/فشل ثنائي للتحليل عالي المستوى
تقييم صحة لتقييم الإجابة النهائية
درجة كفاءة لمقارنة استراتيجيات الوكيل
تقييمات لكل خطوة لتحديد أين تخطئ الوكلاء بالضبط
تصنيف أخطاء شرطي يظهر فقط عند وجود مشكلة في خطوة ما

الخطوة 4: إعداد المخرجات وتشغيل الخادم

أضف إعدادات المخرجات إلى config.yaml:

yaml

output_annotation_dir: "output/"
export_annotation_format: "jsonl"
 
# Optional: also export to Parquet for analysis
parquet_export:
  enabled: true
  output_dir: "output/parquet/"
  compression: zstd

config.yaml الكامل كمرجع:

yaml

annotation_task_name: "ReAct Agent Evaluation"
task_dir: "."
 
data_files:
  - "data/traces.jsonl"
 
item_properties:
  id_key: trace_id
  text_key: task
 
agentic:
  enabled: true
  trace_converter: react
  display_type: agent_trace
  agent_trace_display:
    colors:
      thought: "#6E56CF"
      action: "#3b82f6"
      observation: "#22c55e"
      error: "#ef4444"
    collapse_observations: true
    collapse_threshold: 400
    show_step_numbers: true
    render_json: true
    syntax_highlight: true
 
annotation_schemes:
  - annotation_type: radio
    name: task_success
    description: "Did the agent successfully complete the task?"
    labels: ["Success", "Partial Success", "Failure"]
    label_requirement:
      required: true
    sequential_key_binding: true
 
  - annotation_type: radio
    name: answer_correctness
    description: "Is the agent's final answer factually correct?"
    labels: ["Correct", "Partially Correct", "Incorrect", "Cannot Determine"]
    label_requirement:
      required: true
 
  - annotation_type: likert
    name: efficiency
    description: "Did the agent use an efficient path?"
    min: 1
    max: 5
    labels:
      1: "Very Inefficient"
      3: "Average"
      5: "Optimal"
 
  - annotation_type: text
    name: evaluator_notes
    description: "Any additional observations"
    label_requirement:
      required: false
 
  - annotation_type: per_turn_rating
    name: step_correctness
    target: agentic_steps
    description: "Was this step correct?"
    rating_type: radio
    labels: ["Correct", "Partially Correct", "Incorrect", "Unnecessary"]
 
  - annotation_type: per_turn_rating
    name: error_type
    target: agentic_steps
    description: "Error type"
    rating_type: multiselect
    labels:
      - "Wrong tool/action"
      - "Wrong arguments"
      - "Hallucinated information"
      - "Reasoning error"
      - "Redundant step"
      - "Premature termination"
      - "Other"
    conditional:
      show_when:
        step_correctness: ["Partially Correct", "Incorrect", "Unnecessary"]
 
output_annotation_dir: "output/"
export_annotation_format: "jsonl"
 
parquet_export:
  enabled: true
  output_dir: "output/parquet/"
  compression: zstd

شغّل الخادم:

bash

potato start config.yaml -p 8000

افتح http://localhost:8000 في متصفحك.

الخطوة 5: سير عمل التوصيف

عندما يفتح المُوصِّف مساراً، يرى:

وصف المهمة في الأعلى (استعلام المستخدم الأصلي)
بطاقات الخطوات التي تعرض مسار الوكيل الكامل، ملونة حسب النوع:
- بطاقات بنفسجية للأفكار/الاستدلال
- بطاقات زرقاء للأفعال/استدعاءات الأدوات
- بطاقات خضراء للملاحظات/النتائج
- بطاقات حمراء للأخطاء
عناصر تحكم تقييم لكل خطوة بجانب كل بطاقة خطوة
مخططات مستوى المسار أسفل عرض المسار

سير العمل النموذجي:

اقرأ وصف المهمة لفهم ما كان يُفترض أن يفعله الوكيل
تنقل عبر خطوات المسار، وقيّم كل واحدة
لأي خطوة مقيّمة كـ "صحيحة جزئياً" أو "غير صحيحة"، اختر نوع(أنواع) الخطأ
قيّم المسار ككل (النجاح، الصحة، الكفاءة)
أضف ملاحظات إذا لزم الأمر
أرسل وانتقل إلى المسار التالي

نصائح للمُوصِّفين

وسّع الملاحظات المطوية للتحقق من أن الوكيل عالج المعلومات بشكل صحيح
قارن الإجابة النهائية مع الحقيقة المرجعية (إن وُجدت) قبل تقييم نجاح المهمة
قيّم الخطوات "غير الضرورية" بشكل منفصل عن "غير الصحيحة" -- الخطوة غير الضرورية تهدر الجهد لكنها لا تُدخل أخطاء
استخدم الشريط الزمني للخطوات الجانبي للانتقال إلى خطوات محددة في المسارات الطويلة

الخطوة 6: تحليل النتائج

بعد التوصيف، حلّل النتائج برمجياً.

تحليل أساسي باستخدام pandas

python

import pandas as pd
import json
 
# Load annotations
annotations = []
with open("output/annotations.jsonl") as f:
    for line in f:
        annotations.append(json.loads(line))
 
df = pd.DataFrame(annotations)
 
# Task success rate
success_counts = df.groupby("annotations").apply(
    lambda x: x.iloc[0]["annotations"]["task_success"]
).value_counts()
print("Task Success Distribution:")
print(success_counts)
 
# Average efficiency rating
efficiency_scores = [
    a["annotations"]["efficiency"]
    for a in annotations
    if "efficiency" in a["annotations"]
]
print(f"\nAverage Efficiency: {sum(efficiency_scores) / len(efficiency_scores):.2f}")

تحليل الأخطاء على مستوى الخطوة

python

# Collect all step-level errors
error_counts = {}
for ann in annotations:
    step_errors = ann["annotations"].get("error_type", {})
    for step_idx, errors in step_errors.items():
        for error in errors:
            error_counts[error] = error_counts.get(error, 0) + 1
 
print("Error Type Distribution:")
for error, count in sorted(error_counts.items(), key=lambda x: -x[1]):
    print(f"  {error}: {count}")

التحليل باستخدام DuckDB (عبر Parquet)

python

import duckdb
 
# Overall success rate
result = duckdb.sql("""
    SELECT value, COUNT(*) as count
    FROM 'output/parquet/annotations.parquet'
    WHERE schema_name = 'task_success'
    GROUP BY value
    ORDER BY count DESC
""")
print(result)

الخطوة 7: التوسع

لمشاريع التقييم الأكبر (مئات أو آلاف المسارات)، ضع في اعتبارك هذه الإعدادات:

مُوصِّفون متعددون

عيّن عدة مُوصِّفين لكل مسار لقياس الاتفاق بين المُوصِّفين:

yaml

annotation_task_config:
  total_annotations_per_instance: 3
  assignment_strategy: random

استخدام المخططات الجاهزة

للإعداد السريع، استخدم مخططات تقييم الوكلاء الجاهزة في Potato:

yaml

annotation_schemes:
  - preset: agent_task_success
  - preset: agent_step_correctness
  - preset: agent_error_taxonomy
  - preset: agent_efficiency

مراقبة الجودة

فعّل النماذج المرجعية الذهبية لمراقبة الجودة:

yaml

phases:
  training:
    enabled: true
    data_file: "data/training_traces.jsonl"
    passing_criteria:
      min_correct: 4
      total_questions: 5

التكيف مع أنواع وكلاء أخرى

OpenAI Function Calling

yaml

agentic:
  enabled: true
  trace_converter: openai
  display_type: agent_trace

Anthropic Tool Use

yaml

agentic:
  enabled: true
  trace_converter: anthropic
  display_type: agent_trace

أنظمة الوكلاء المتعددين (CrewAI/AutoGen)

yaml

agentic:
  enabled: true
  trace_converter: multi_agent
  display_type: agent_trace
  multi_agent:
    agent_converters:
      researcher: react
      writer: anthropic
      reviewer: openai

وكلاء تصفح الويب

لوكلاء الويب، انتقل إلى عرض وكيل الويب:

yaml

agentic:
  enabled: true
  trace_converter: webarena
  display_type: web_agent
  web_agent_display:
    screenshot_max_width: 900
    overlay:
      enabled: true
    filmstrip:
      enabled: true

انظر توصيف وكلاء تصفح الويب للحصول على دليل مخصص.

الملخص

يتطلب التقييم البشري لوكلاء الذكاء الاصطناعي أدوات متخصصة. يوفر نظام التوصيف الوكيلي في Potato:

12 محوّلاً لتوحيد المسارات من أي إطار عمل
3 أنواع عرض محسّنة لاستخدام الأدوات وتصفح الويب والوكلاء المحادثاتية
تقييمات لكل دور للتقييم على مستوى الخطوة
9 مخططات جاهزة تغطي أبعاد التقييم الشائعة
تصدير Parquet للتحليل الفعّال في المراحل اللاحقة

الرؤية الأساسية هي أن تقييم الوكيل ليس مجرد "هل حصل الوكيل على الإجابة الصحيحة؟" -- بل "هل استدل الوكيل بشكل صحيح في كل خطوة؟" التوصيف لكل خطوة يكشف أنماط الأخطاء التي تفوتها المقاييس الإجمالية.

قراءة إضافية

توثيق التوصيف الوكيلي
توصيف وكلاء تصفح الويب
الوضع الفردي -- اجمع التوصيف الوكيلي مع التقييم التعاوني بين الإنسان ونموذج اللغة الكبير
مقياس الأفضل-الأسوأ -- رتّب مخرجات الوكلاء بشكل مقارن
تصدير Parquet -- تصدير فعّال للتحليل