Skip to content

多模态智能体评估

评估在文本之外采取行动的智能体:computer-use 与 GUI 智能体、语音助手、视频与文档智能体。Potato 为带点击定位的 GUI 轨迹、全双工语音时间线、带实时 IoU 的视频时间定位、语音转录错误标记、交错式多模态推理以及表格网格结构提供专门构建的 schema。

智能体越来越多地在文本之外的模态中行动:它们驱动 GUI、观看视频、进行语音对话。每种模态都需要一个纯文本控件无法提供的审查界面,比如一张带智能体点击位置的截图、一条双轨语音时间线、一个带黄金区间的视频拖动条。 Potato 为这些轨迹提供专门构建的标注 schema,与它现有的图像音频视频视图并列。

每个 schema 在渲染时都从轨迹中推导出其步骤、回合或片段,并且每个都在 examples/agent-traces/ 下附带一个可运行的示例。

GUI/computer-use 轨迹(gui_trajectory

逐步评估一个 computer-use、GUI 或 OS 智能体(OSWorld,NeurIPS 2024;ScreenSpot-Pro;AndroidWorld)。每一步展示智能体所看到的截图和它所采取的动作;标注者对动作做出裁定(正确/错误元素/错误动作/幻觉)。当某一步带有点击坐标时,截图上的一个定位标记会显示该点击是否落在了正确的元素上。

带动作裁定和点击定位标记的 computer-use 步骤Review each computer-use step: action correctness plus click-grounding on the screenshot

yaml
annotation_schemes:
  - annotation_type: gui_trajectory
    name: gui_review
    description: "For each step: was the action correct and did the click land right?"
    steps_key: steps
    screenshot_key: screenshot   # field on each step holding an image URL / data-URI
    action_key: action           # field holding the action text
    coord_space: normalized      # normalized (0..1) | pixels — for the x/y grounding marker
    verdict_options: [correct, wrong_element, wrong_action, hallucinated]

每一步可提供 screenshotaction,以及可选的 x/y(或嵌套的 click: {x, y})。存储为一个 {index, step, verdict, notes} 列表。

语音/全双工交互(voice_interaction

对一段语音的人↔智能体对话标注其轮流发言和打断(barge-in)处理(Full-Duplex-Bench,2025)。一条双轨时间线(用户泳道加智能体泳道)按起止时间放置每个回合,并高亮重叠区域,即两位说话者同时说话之处。标注者对每个重叠进行分类(智能体应当回应/应当继续/附和/不确定)并对整体轮流发言进行评分;提供时,源音频会内联播放。

带高亮打断区域的双轨语音时间线A dual-track voice timeline with barge-in detection and turn-taking scoring

yaml
annotation_schemes:
  - annotation_type: voice_interaction
    name: turn_taking
    description: "Classify each barge-in/overlap and rate the overall turn-taking."
    turns_key: turns           # list of {speaker, start, end, text} (seconds)
    speaker_key: speaker
    user_speakers: [user, human, caller]   # everything else is treated as the agent
    overlap_labels: [agent_should_respond, agent_should_resume, backchannel, uncertain]
    rating_scale: 5
    # audio_key: audio         # optional per-instance audio URL to enable the player

不同说话者回合之间的重叠在渲染时计算得出。存储为 {"overlaps": {idx: label}, "rating": int}

视频时间定位(temporal_grounding

在视频中标出事件时间区间以进行时间定位评估(TimeScope,2025;ET-Bench)。对于每个事件提示,标注者设定黄金 [start, end],方式是捕获播放头位置或输入秒数。当数据带有模型预测的区间时,一个实时 IoU 和一条两条棒的迷你时间线(预测 vs. 黄金)会随你调整而更新。这是为"预测 vs. 黄金"的定位评分专门构建的,区别于通用的片段标注。

带黄金区间和实时 IoU 读数的视频拖动条Mark gold event intervals on video with a live IoU vs. the model's prediction

yaml
annotation_schemes:
  - annotation_type: temporal_grounding
    name: grounding
    description: "Mark the gold start/end interval for each event. IoU vs prediction updates live."
    video_key: video           # per-instance video URL
    events_key: events         # list of {prompt, predicted: {start, end}} (predicted optional)
    # duration: 120            # optional fixed timeline scale (else inferred from the video)

存储为 {"events": {idx: {start, end}}}

对齐转录的语音错误(speech_transcript

对一段时间对齐的语音转录逐片段标注 ASR/TTS 与语音质量错误(Speak & Improve,2025)。每个片段 {start, end, text, speaker?} 是一张显示其时间戳和文本的卡片;标注者标记错误(ASR 错误/TTS 瑕疵/发音错误/不流畅)并可输入修正后的转录。这是对 voice_interaction 中轮流发言视图的片段级补充。

带逐片段错误标记和内联修正的语音转录片段Tag ASR/TTS/pronunciation errors per segment and correct the transcript inline

yaml
annotation_schemes:
  - annotation_type: speech_transcript
    name: speech_errors
    description: "Tag speech errors on each segment and correct the transcript where needed."
    segments_key: segments       # list of {start, end, text, speaker?}
    error_types: [asr_error, tts_artifact, mispronunciation, disfluency]
    allow_correction: true
    # audio_key: audio           # optional per-item audio URL to enable the player

存储为一个 {index, start, end, errors, correction} 列表。

交错式多模态推理(multimodal_reasoning

逐步对一段交错的文本 ↔ 图像 ↔ 工具 ↔ 动作推理轨迹进行评分(Multimodal RewardBench 2,2025;Zebra-CoT)。每一步是一个有类型的块,按其类型内联渲染;标注者评判每一步的连贯性,即推理是否从图像和先前步骤中推导而来,还是视觉内容是幻觉

带被标记视觉幻觉的交错推理轨迹Rate each step of a text-image-tool reasoning trace for coherence and visual hallucination

yaml
annotation_schemes:
  - annotation_type: multimodal_reasoning
    name: reasoning_review
    description: "Judge each step: coherent reasoning and grounded visuals?"
    steps_key: steps
    type_key: type     # each step's 'type': text | image | tool | action (inferred if absent)
    verdict_options: [coherent, incoherent, visual_hallucination, uncertain]

每一步可携带 text/contentimage/image_url(+caption),或 tool/args。存储为一个 {index, step, type, verdict, notes} 列表。

表格网格结构(table_grid

标注一张表格图像的单元格结构,这是普通边界框无法捕捉的文档专属部分(OmniDocBench,CVPR 2025;RealHiTBench)。标注者设定网格尺寸并点击单元格以标记其角色(数据/列表头/行表头/空)。逐页的区域框已由对每一页运行图像标注所覆盖,因此该 schema 聚焦于那些框无法表达的结构。

将单元格标记为表头、数据和空的表格图像Annotate document-table cell structure: column and row headers, data, and empty cells

yaml
annotation_schemes:
  - annotation_type: table_grid
    name: structure
    description: "Set the grid size, then click cells to mark headers and empty cells."
    image_key: image           # per-instance table image URL / data-URI
    rows_key: rows             # optional initial dims from the data
    cols_key: cols
    roles: [data, col_header, row_header, empty]   # click cycles through these

存储为 {rows, cols, cells: {"r,c": role}},只保留非 data 的单元格。

相关内容

如需实现细节,请参阅源文档