OSWorld: Desktop Agent Task Evaluation
Evaluation of multimodal agents performing open-ended tasks in real desktop environments. Annotators assess task success, identify OS-level actions, rate efficiency, and analyze failures across Ubuntu, Windows, and macOS environments.
Configuration Fileconfig.yaml
# OSWorld: Desktop Agent Task Evaluation
# Based on "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments" (Xie et al., NeurIPS 2024)
# Task: Evaluate agent performance on desktop tasks across Ubuntu, Windows, and macOS
annotation_task_name: "OSWorld Desktop Agent Evaluation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout with desktop screenshots and multi-app trajectory
html_layout: |
<div class="osworld-container" style="max-width: 900px; margin: 0 auto;">
<div class="task-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px; border-left: 4px solid #2e7d32;">
<h3 style="margin-top: 0; color: #2e7d32;">Task Instruction</h3>
<div style="font-size: 16px; font-weight: bold;">{{text}}</div>
</div>
<div class="env-info" style="display: flex; gap: 15px; margin-bottom: 15px;">
<div style="background: #e3f2fd; padding: 10px 15px; border-radius: 8px; flex: 1;">
<strong>OS:</strong> {{os_type}}
</div>
<div style="background: #fce4ec; padding: 10px 15px; border-radius: 8px; flex: 1;">
<strong>Applications:</strong> {{applications}}
</div>
</div>
<div class="screenshot-gallery" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #78909c;">
<h3 style="margin-top: 0; color: #37474f;">Desktop State Sequence</h3>
<div style="white-space: pre-wrap; font-size: 13px; line-height: 1.8; font-family: monospace; background: #fff; padding: 12px; border-radius: 6px;">{{screenshots}}</div>
</div>
<div class="trajectory-section" style="background: #fff3e0; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #ef6c00;">
<h3 style="margin-top: 0; color: #e65100;">Agent Trajectory</h3>
<div style="white-space: pre-wrap; font-family: monospace; font-size: 13px; line-height: 1.6;">{{trajectory}}</div>
</div>
<div class="final-state-section" style="background: #ede7f6; padding: 15px; border-radius: 8px; border: 2px solid #5e35b1;">
<h3 style="margin-top: 0; color: #5e35b1;">Final State</h3>
<div style="white-space: pre-wrap; font-size: 14px;">{{final_state}}</div>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Task success assessment
- name: "task_success"
description: "Did the agent successfully complete the desktop task?"
annotation_type: radio
labels:
- "Success — task completed correctly"
- "Partial — some steps completed"
- "Failed — task not completed"
keyboard_shortcuts:
"Success — task completed correctly": "1"
"Partial — some steps completed": "2"
"Failed — task not completed": "3"
# OS action types used
- name: "os_actions"
description: "Which OS-level actions did the agent use during the trajectory?"
annotation_type: multiselect
labels:
- "Mouse Click"
- "Keyboard Input"
- "Scroll"
- "Keyboard Shortcut"
- "Drag and Drop"
- "File Operation"
- "Application Switch"
- "Terminal Command"
- "Menu Navigation"
# Efficiency rating
- name: "efficiency"
description: "How efficient was the agent's approach to completing the task?"
annotation_type: likert
min_label: "Very Inefficient"
max_label: "Optimal"
size: 5
# Failure analysis
- name: "failure_analysis"
description: "If the task was not fully completed, describe what went wrong."
annotation_type: text
required: false
placeholder: "Describe what went wrong: wrong application used, incorrect sequence, missed steps, UI misunderstanding..."
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "osw_001",
"text": "Create a new spreadsheet in LibreOffice Calc with a monthly budget template, add formulas for totals, and save it as 'budget_2024.xlsx' on the Desktop.",
"os_type": "Ubuntu 22.04",
"applications": "LibreOffice Calc, Files",
"screenshots": "Screenshot 1: Ubuntu desktop with GNOME shell. Taskbar at top shows Activities, clock, system tray. Desktop is clean with a wallpaper showing the Ubuntu Jammy Jellyfish default background. The Files icon and LibreOffice Calc icon are visible in the dock on the left side.\n\nScreenshot 2: LibreOffice Calc is open with a new blank spreadsheet. The agent has typed headers in row 1: A1='Category', B1='January', C1='February', D1='March', E1='Total'. Rows 2-7 contain budget categories: Rent, Groceries, Utilities, Transport, Entertainment, Savings. Cell styling shows bold headers with a light blue background.\n\nScreenshot 3: The spreadsheet now has sample values filled in for January through March. Column E contains SUM formulas (visible in the formula bar: =SUM(B2:D2)). Row 8 shows 'Monthly Total' with SUM formulas for each column. The cell E8 shows a grand total. Number formatting shows dollar signs.\n\nScreenshot 4: Save As dialog is open. The file name field shows 'budget_2024'. The format dropdown shows 'Microsoft Excel (.xlsx)'. The location breadcrumb shows Desktop. The Save button is highlighted.",
"trajectory": "Step 1: Click LibreOffice Calc icon in dock\nStep 2: Type 'Category' in cell A1, press Tab\nStep 3: Type month headers 'January', 'February', 'March', 'Total' in B1:E1\nStep 4: Type budget categories in A2:A7 (Rent, Groceries, Utilities, Transport, Entertainment, Savings)\nStep 5: Enter sample budget values in B2:D7\nStep 6: Click cell E2, type =SUM(B2:D2), press Enter\nStep 7: Drag fill handle from E2 down to E7 to copy formula\nStep 8: Type 'Monthly Total' in A8\nStep 9: Enter =SUM(B2:B7) in B8, drag fill to E8\nStep 10: Select headers row, apply bold formatting and background color\nStep 11: Press Ctrl+Shift+S to open Save As dialog\nStep 12: Navigate to Desktop in file browser\nStep 13: Type 'budget_2024' in filename field\nStep 14: Select 'Microsoft Excel (.xlsx)' from format dropdown\nStep 15: Click Save\nStep 16: STOP — Task complete",
"final_state": "File 'budget_2024.xlsx' saved on Desktop. LibreOffice Calc shows the completed budget spreadsheet with category labels, three months of data, SUM formulas in Total column and Monthly Total row, and formatted headers. Desktop file manager confirms the file exists at ~/Desktop/budget_2024.xlsx."
},
{
"id": "osw_002",
"text": "Download an image from Firefox, open it in GIMP, resize it to 800x600, and export it as a PNG to the Documents folder.",
"os_type": "Ubuntu 22.04",
"applications": "Firefox, GIMP, Files",
"screenshots": "Screenshot 1: Firefox browser open showing an image search results page. Multiple thumbnail images are displayed in a grid layout. The agent is right-clicking on a landscape photograph showing a mountain lake at sunset.\n\nScreenshot 2: Firefox right-click context menu is visible over the image. Menu options include 'Open Image in New Tab', 'Save Image As...', 'Copy Image', 'Copy Image Link'. The cursor is hovering over 'Save Image As...'.\n\nScreenshot 3: GIMP is open with the downloaded image loaded. The image shows the mountain lake photograph at its original resolution (2400x1800 visible in the title bar). The 'Scale Image' dialog is open showing Width: 800, Height: 600, with the chain link icon (maintain aspect ratio) unlocked. Interpolation set to 'Cubic'.\n\nScreenshot 4: GIMP 'Export Image As' dialog is open. The filename field shows 'mountain_lake.png'. The file browser is navigated to the Documents folder (/home/user/Documents). PNG export options are visible below with compression level slider.",
"trajectory": "Step 1: Click Firefox icon in dock\nStep 2: Navigate to image search page\nStep 3: Right-click on mountain lake photograph\nStep 4: Click 'Save Image As...' from context menu\nStep 5: Save to Downloads folder as 'mountain_lake.jpg'\nStep 6: Open GIMP from Applications menu\nStep 7: File > Open, navigate to Downloads, select 'mountain_lake.jpg'\nStep 8: Image > Scale Image from menu bar\nStep 9: Unlock aspect ratio chain link\nStep 10: Set width to 800, height to 600\nStep 11: Click 'Scale' button\nStep 12: File > Export As\nStep 13: Navigate to Documents folder\nStep 14: Change filename to 'mountain_lake.png'\nStep 15: Click 'Export'\nStep 16: Click 'Export' in PNG options dialog\nStep 17: STOP — Task complete",
"final_state": "Image exported as 'mountain_lake.png' in Documents folder. GIMP shows the resized image at 800x600 pixels. File manager confirms /home/user/Documents/mountain_lake.png exists with the correct dimensions."
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/osworld-desktop-agent-eval potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
AgentBoard Progress Scoring
Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.
RefactorBench Multi-File Evaluation
Evaluate multi-file refactoring operations generated by coding agents. Annotators assess whether refactorings preserve behavior, identify the types of refactoring applied, rate code improvement, and provide detailed review comments.
SayCan - Robot Task Planning Evaluation
Evaluate robot action plans generated from natural language instructions, based on the SayCan framework (Ahn et al., CoRL 2022). Annotators assess feasibility, identify primitive actions, describe plans, and rate safety of grounded language-conditioned robot manipulation tasks.