VSTAR Video-grounded Dialogue

Video-grounded dialogue annotation. Annotators watch videos and answer questions requiring situated understanding, write dialogue turns grounded in specific video moments, and mark relevant temporal segments.

Configuration Fileconfig.yaml

yaml

# VSTAR Video-grounded Dialogue Configuration
# Based on Wang et al., ACL 2023
# Task: Answer questions and write dialogue grounded in specific video moments

annotation_task_name: "VSTAR Video-grounded Dialogue"
task_dir: "."

# Data configuration
data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "video_url"

# Output
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

# Annotation schemes
annotation_schemes:
  # Temporal grounding - mark relevant video segments
  - name: "temporal_grounding"
    description: |
      Mark the temporal segment(s) in the video that are relevant to answering the question.
      These segments should contain the visual evidence needed for the answer.
    annotation_type: "video_annotation"
    mode: "segment"
    labels:
      - name: "relevant_segment"
        color: "#3B82F6"
        key_value: "r"
      - name: "supporting_context"
        color: "#22C55E"
        key_value: "s"
    frame_stepping: true
    show_timecode: true
    playback_rate_control: true
    zoom_enabled: true
    timeline_height: 80

  # Answer / dialogue response
  - name: "answer"
    description: |
      Write your answer to the question based on what you observe in the video.
      Ground your answer in specific visual evidence from the marked segments.
      Be detailed and specific about what you see.
    annotation_type: text
    min_length: 15
    max_length: 500
    placeholder: "e.g., 'The person in the blue shirt picks up the book from the table and hands it to the woman standing by the door...'"

  # Answer type classification
  - name: "answer_type"
    description: |
      Classify the type of reasoning required to answer this question.
      Consider what kind of understanding is needed.
    annotation_type: radio
    labels:
      - name: "Factual"
        tooltip: "Answer is directly observable in the video (e.g., 'What color is the car?')"
        key_value: "f"
      - name: "Inferential"
        tooltip: "Answer requires inference from visual cues (e.g., 'Why did the person leave?')"
        key_value: "i"
      - name: "Predictive"
        tooltip: "Answer requires predicting what happens next (e.g., 'What will the person do?')"
        key_value: "p"

  # Confidence assessment
  - name: "confidence"
    description: "How confident are you in your answer?"
    annotation_type: radio
    labels:
      - name: "Very Confident"
        tooltip: "The answer is clearly supported by the video"
        key_value: "1"
      - name: "Somewhat Confident"
        tooltip: "The answer is likely correct but some ambiguity exists"
        key_value: "2"
      - name: "Not Confident"
        tooltip: "The answer is a best guess; the video does not clearly support it"
        key_value: "3"

  # Dialogue continuation
  - name: "follow_up_question"
    description: |
      Write a natural follow-up question that could continue the dialogue
      about this video. The question should require watching a different
      part of the video or reasoning about what was discussed.
    annotation_type: text
    min_length: 10
    max_length: 200
    placeholder: "e.g., 'What happens after the person leaves the room?'"

# User configuration
allow_all_users: true

# Task assignment
instances_per_annotator: 25
annotation_per_instance: 3

# Instructions
annotation_instructions: |
  ## VSTAR Video-grounded Dialogue Task

  Your goal is to answer questions about videos and write grounded dialogue responses.

  ### Step 1: Watch the Video
  - Watch the entire video to understand the full context
  - Read the question and any dialogue context provided

  ### Step 2: Mark Relevant Segments
  - Identify the video segment(s) that contain evidence for your answer
  - Mark primary evidence as "relevant_segment" (blue)
  - Mark additional context as "supporting_context" (green)
  - Be precise with segment boundaries

  ### Step 3: Write Your Answer
  - Answer the question based on what you observe in the video
  - Reference specific visual details (people, objects, actions, locations)
  - Be descriptive but concise
  - Ground your answer in the marked segments

  ### Step 4: Classify Answer Type
  - **Factual (f)**: Answer is directly observable in the video
  - **Inferential (i)**: Answer requires reasoning beyond what is shown
  - **Predictive (p)**: Answer involves predicting future events

  ### Step 5: Write a Follow-up Question
  - Write a natural question that would continue the conversation
  - The question should require watching the video to answer
  - Avoid yes/no questions; ask for descriptive responses

  ### Tips:
  - Read the dialogue context carefully; it provides important background
  - Some questions require understanding events across multiple time points
  - If unsure, mark your confidence accordingly
  - Follow-up questions should be genuinely interesting and answerable from the video

Sample Datasample-data.json

json

[
  {
    "id": "vstar_001",
    "video_url": "https://example.com/videos/kitchen_cooking_001.mp4",
    "question": "What does the person do after placing the vegetables on the cutting board?",
    "dialogue_context": "Speaker A: It looks like they are preparing a meal. Speaker B: Yes, they just washed some vegetables.",
    "timestamp_start": 15,
    "timestamp_end": 45
  },
  {
    "id": "vstar_002",
    "video_url": "https://example.com/videos/office_meeting_001.mp4",
    "question": "Why does the woman standing by the whiteboard look surprised?",
    "dialogue_context": "Speaker A: The team seems to be discussing quarterly results. Speaker B: The presenter just revealed some numbers.",
    "timestamp_start": 30,
    "timestamp_end": 60
  }
]

// ... and 8 more items

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/video/vstar-video-dialogue
potato start config.yaml

Details

Annotation Types

video_annotationtextradio

Domain

Computer VisionVideo UnderstandingDialogue

Use Cases

Video QADialogue GenerationTemporal Grounding

Related Designs

Ego4D: Egocentric Video Episodic Memory Annotation

Annotate egocentric (first-person) video for episodic memory tasks including activity segmentation, hand state tracking, natural language query generation, and scene narration. Supports temporal segment annotation with multiple label tiers for the Ego4D benchmark.

video_annotationtext

HowTo100M Instructional Video Annotation

Annotate instructional video clips with step descriptions and visual grounding. Link narrated instructions to visual actions for video-language understanding.

radiotext

MVBench Video Understanding

Comprehensive video understanding benchmark with multiple-choice questions, video segment annotation, and reasoning, based on MVBench (Li et al., arXiv 2023). Tests temporal perception, action recognition, and state change detection in videos.