AndroidWorld: Mobile Agent Task Evaluation
Evaluation of autonomous agents performing tasks in dynamic Android environments. Annotators assess task completion, identify interaction types and touch gestures, evaluate UI understanding, and describe mobile-specific issues across diverse Android apps.
Configuration Fileconfig.yaml
# AndroidWorld: Mobile Agent Task Evaluation
# Based on "AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents" (Rawles et al., ICLR 2025)
# Task: Evaluate agent performance on Android mobile tasks across diverse apps
annotation_task_name: "AndroidWorld Mobile Agent Evaluation"
task_dir: "."
# Data configuration
data_files:
- sample-data.json
item_properties:
id_key: "id"
text_key: "text"
# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
# Display layout with mobile screenshots and touch gesture trajectory
html_layout: |
<div class="android-container" style="max-width: 900px; margin: 0 auto;">
<div class="task-section" style="background: #e8f5e9; padding: 15px; border-radius: 8px; margin-bottom: 15px; border-left: 4px solid #388e3c;">
<h3 style="margin-top: 0; color: #388e3c;">Task Instruction</h3>
<div style="font-size: 16px; font-weight: bold;">{{text}}</div>
</div>
<div class="device-info" style="display: flex; gap: 15px; margin-bottom: 15px;">
<div style="background: #e3f2fd; padding: 10px 15px; border-radius: 8px; flex: 1;">
<strong>App:</strong> {{app_name}}
</div>
<div style="background: #fff3e0; padding: 10px 15px; border-radius: 8px; flex: 1;">
<strong>Android Version:</strong> {{android_version}}
</div>
</div>
<div class="screenshot-gallery" style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #78909c;">
<h3 style="margin-top: 0; color: #37474f;">Mobile Screen Sequence</h3>
<div style="white-space: pre-wrap; font-size: 13px; line-height: 1.8; font-family: monospace; background: #fff; padding: 12px; border-radius: 6px;">{{screenshots}}</div>
</div>
<div class="trajectory-section" style="background: #fce4ec; padding: 15px; border-radius: 8px; margin-bottom: 15px; border: 2px solid #c62828;">
<h3 style="margin-top: 0; color: #c62828;">Agent Trajectory (with Touch Gestures)</h3>
<div style="white-space: pre-wrap; font-family: monospace; font-size: 13px; line-height: 1.6;">{{trajectory}}</div>
</div>
<div class="final-state-section" style="background: #e8eaf6; padding: 15px; border-radius: 8px; border: 2px solid #3f51b5;">
<h3 style="margin-top: 0; color: #3f51b5;">Final Screen State</h3>
<div style="white-space: pre-wrap; font-size: 14px;">{{final_state}}</div>
</div>
</div>
# Annotation schemes
annotation_schemes:
# Task completion assessment
- name: "task_completion"
description: "Did the agent successfully complete the mobile task?"
annotation_type: radio
labels:
- "Complete — task fully achieved"
- "Partial — some steps done"
- "Failed — task not completed"
keyboard_shortcuts:
"Complete — task fully achieved": "1"
"Partial — some steps done": "2"
"Failed — task not completed": "3"
# Interaction types used
- name: "interaction_types"
description: "Which interaction types did the agent use during the trajectory?"
annotation_type: multiselect
labels:
- "Tap"
- "Long Press"
- "Swipe"
- "Scroll"
- "Type Text"
- "Back Button"
- "Home Button"
- "App Switch"
- "Notification Interaction"
# UI understanding assessment
- name: "ui_understanding"
description: "Did the agent correctly understand and navigate the mobile UI?"
annotation_type: radio
labels:
- "Yes — correctly understood mobile UI"
- "Mostly — minor UI misunderstandings"
- "Partially — several UI errors"
- "No — fundamental UI confusion"
keyboard_shortcuts:
"Yes — correctly understood mobile UI": "q"
"Mostly — minor UI misunderstandings": "w"
"Partially — several UI errors": "e"
"No — fundamental UI confusion": "r"
# Mobile-specific issues
- name: "mobile_issues"
description: "Describe any mobile-specific issues encountered by the agent."
annotation_type: text
required: false
placeholder: "Describe mobile-specific issues: gesture failures, small tap targets, orientation problems, notification handling errors..."
# User configuration
allow_all_users: true
# Task assignment
instances_per_annotator: 50
annotation_per_instance: 2
Sample Datasample-data.json
[
{
"id": "aw_001",
"text": "Add a new contact named 'Dr. Emily Watson' with phone number 555-0142 and email emily.watson@hospital.org to the Contacts app.",
"app_name": "Contacts",
"android_version": "Android 13",
"screenshots": "Screen 1: Contacts app main view in portrait orientation. Top bar shows 'Contacts' title with search icon and three-dot menu. A floating action button (blue circle with '+' icon) is in the bottom-right corner. The contact list shows 12 existing contacts sorted alphabetically with circular avatar icons and names.\n\nScreen 2: New contact form is displayed. Fields visible from top to bottom: profile photo placeholder (camera icon in a circle), First name, Last name, Company, Phone (with phone icon), Email (with envelope icon). The keyboard is open at the bottom of the screen. 'First name' field is focused with a blinking cursor.\n\nScreen 3: Contact form partially filled. First name: 'Emily', Last name: 'Watson', Name prefix field added showing 'Dr.' Phone field shows '555-0142' with type label 'Mobile'. Email field shows 'emily.watson@hospital.org' with type label 'Work'. The 'Save' button (checkmark icon) is visible in the top-right corner.\n\nScreen 4: Contact detail view for 'Dr. Emily Watson'. The screen shows the contact card with name at top, phone number 555-0142 with call/message action buttons, and email emily.watson@hospital.org with email action button. A green banner at top briefly shows 'Contact saved'.",
"trajectory": "Step 1: [TAP] Floating action button ('+' icon) in bottom-right corner\nStep 2: [TAP] Name prefix field, [TYPE] 'Dr.'\nStep 3: [TAP] First name field, [TYPE] 'Emily'\nStep 4: [TAP] Last name field, [TYPE] 'Watson'\nStep 5: [SCROLL] Down to reveal phone field\nStep 6: [TAP] Phone field, [TYPE] '555-0142'\nStep 7: [TAP] Email field, [TYPE] 'emily.watson@hospital.org'\nStep 8: [TAP] Email type label, select 'Work' from dropdown\nStep 9: [TAP] Save button (checkmark icon) in top-right\nStep 10: STOP — Task complete",
"final_state": "Contact detail screen showing 'Dr. Emily Watson'. Phone: 555-0142 (Mobile). Email: emily.watson@hospital.org (Work). Contact successfully saved and visible in the contacts list. Save confirmation banner displayed."
},
{
"id": "aw_002",
"text": "Create a calendar event for 'Team Standup' recurring every weekday at 9:00 AM for 15 minutes in the Calendar app.",
"app_name": "Google Calendar",
"android_version": "Android 14",
"screenshots": "Screen 1: Google Calendar app in month view. The current month is displayed with colored dots indicating existing events on various dates. The floating action button ('+' icon) is in the bottom-right corner. The navigation drawer icon (hamburger menu) is in the top-left. The current date is highlighted with a blue circle.\n\nScreen 2: New event creation screen. Title field at top shows 'Team Standup' typed in. Below: date picker showing today's date, start time set to '9:00 AM', end time set to '9:15 AM'. 'Add people' field, 'Add location' field, 'Add notification' option, and 'More options' link are visible. The 'Does not repeat' dropdown is shown below the date.\n\nScreen 3: Recurrence picker dialog is open, overlaying the event creation screen. Options listed: 'Does not repeat', 'Every day', 'Every week', 'Every month', 'Every year', 'Custom...'. The agent has tapped 'Custom...'. A custom recurrence dialog shows: Repeat every 1 week, with checkboxes for days — Mon, Tue, Wed, Thu, Fri are all checked (highlighted in blue). Sat and Sun are unchecked.\n\nScreen 4: Back on the event creation screen. The recurrence field now shows 'Every weekday (Mon-Fri)'. All other fields are populated. The 'Save' button is visible in the top-right corner. The event summary reads: Team Standup, 9:00 AM - 9:15 AM, Every weekday.",
"trajectory": "Step 1: [TAP] Floating action button ('+') in bottom-right\nStep 2: [TAP] Title field, [TYPE] 'Team Standup'\nStep 3: [TAP] Start time field\nStep 4: [TAP] Hour selector to '9', minute to '00', AM selected\nStep 5: [TAP] End time field\nStep 6: [TAP] Hour selector to '9', minute to '15', AM selected\nStep 7: [TAP] 'Does not repeat' dropdown\nStep 8: [TAP] 'Custom...' option\nStep 9: [TAP] Checkboxes for Mon, Tue, Wed, Thu, Fri\nStep 10: [TAP] 'Done' to confirm recurrence\nStep 11: [TAP] 'Save' button in top-right\nStep 12: STOP — Task complete",
"final_state": "Google Calendar returns to month view. The newly created 'Team Standup' event appears on all weekdays. Event details: 9:00 AM - 9:15 AM, recurring every weekday (Mon-Fri). No end date set for the recurrence."
}
]
// ... and 6 more itemsGet This Design
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/agentic/androidworld-mobile-agent-eval potato start config.yaml
Details
Annotation Types
Domain
Use Cases
Tags
Found an issue or want to improve this design?
Open an IssueRelated Designs
OSWorld: Desktop Agent Task Evaluation
Evaluation of multimodal agents performing open-ended tasks in real desktop environments. Annotators assess task success, identify OS-level actions, rate efficiency, and analyze failures across Ubuntu, Windows, and macOS environments.
VisualWebArena: Visual Web Agent Evaluation
Evaluation of multimodal web agents on visually grounded web tasks. Annotators assess task completion, visual grounding accuracy, and visual reasoning capabilities by reviewing screenshot sequences and agent trajectories.
IFEval: Instruction-Following Evaluation for LLMs
Evaluate how well large language models follow verifiable instructions with specific constraints such as word count limits, formatting requirements, keyword inclusion, and structural rules. Annotators assess pass/fail per constraint and overall response quality.