WavCaps - Audio Captioning

Audio captioning - write natural language descriptions of audio content. Annotators listen to audio clips and write detailed captions describing all sounds, events, and acoustic scenes (Mei et al., IEEE TASLP 2024).

Fichier de configurationconfig.yaml

# WavCaps - Audio Captioning
# Based on Mei et al., IEEE TASLP 2024
# Paper: https://ieeexplore.ieee.org/document/10637816
# Dataset: https://github.com/XinhaoMei/WavCaps
#
# Task: Write natural language descriptions of audio content.
# Listen to audio clips and write detailed captions describing all sounds,
# events, and acoustic scenes.
#
# Guidelines:
# - Listen to the full audio clip before writing a caption
# - Describe all notable sounds, events, and background ambience
# - Use clear, concise natural language
# - Include temporal information if relevant (e.g., "A dog barks, then a door slams")
# - List individual sound events separately in the sound events field

annotation_task_name: "WavCaps: Audio Captioning"
task_dir: "."

data_files:
  - sample-data.json
item_properties:
  id_key: "id"
  text_key: "audio_url"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  - annotation_type: text
    name: caption
    description: "Write a detailed natural language caption describing the audio content. Include all sounds, events, and acoustic scenes you hear."
    min_length: 10
    max_length: 500
    placeholder: "Describe what you hear in this audio clip..."

  - annotation_type: text
    name: sound_events
    description: "List the individual sound events heard in the clip, separated by commas (e.g., 'dog barking, car engine, wind blowing')"
    min_length: 3
    max_length: 300
    placeholder: "List sound events separated by commas..."

audio_display:
  show_waveform: true
  playback_controls: true
  allow_speed_control: true

allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 3
allow_skip: true
skip_reason_required: false

Données d'exemplesample-data.json

[
  {
    "id": "wavcaps_001",
    "audio_url": "https://example.com/audio/wavcaps/urban_street_001.wav",
    "duration": 10,
    "source": "Freesound"
  },
  {
    "id": "wavcaps_002",
    "audio_url": "https://example.com/audio/wavcaps/kitchen_cooking_001.wav",
    "duration": 8.5,
    "source": "AudioSet"
  }
]

// ... and 8 more items

Obtenir ce design

View on GitHub

Clone or download from the repository

Démarrage rapide :

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/audio/wavcaps-audio-captioning
potato start config.yaml

Détails

Types d'annotation

text

Domaine

Audio UnderstandingCaptioning

Cas d'utilisation

Audio CaptioningSound Event DescriptionAudio-Language Research

Étiquettes

audiocaptioningsound-eventsaudio-languageenvironmental-audiotaslp2024

Vous avez trouvé un problème ou souhaitez améliorer ce design ?

Ouvrir un ticket

Designs associés

Audio Transcription Review

Review and correct automatic speech recognition transcriptions with waveform visualization.

likertmultiselect

Clotho Audio Captioning

Audio captioning and quality assessment based on the Clotho dataset (Drossos et al., ICASSP 2020). Annotators write natural language captions for audio clips, rate caption accuracy on a Likert scale, and classify the audio environment.

textlikert

CoVoST 2 - Speech Translation Evaluation

Speech translation quality evaluation based on the CoVoST 2 dataset (Wang et al., arXiv 2020). Annotators listen to source audio, review translations, label audio segments, and rate overall translation quality.

textradio