intermediatepreference
OpenAssistant Conversation Quality
Rate AI assistant responses across multiple quality dimensions. Evaluate conversations for the OpenAssistant crowdsourced dataset.
Configuration Fileconfig.yaml
# OpenAssistant Conversation Quality Configuration
# Based on Köpf et al., NeurIPS 2023
# Task: Rate conversation quality across multiple dimensions
annotation_task_name: "OpenAssistant Conversation Quality"
task_dir: "."
data_files:
- data.json
item_properties:
id_key: "id"
text_key: "text"
output_annotation_dir: "annotation_output/"
output_annotation_format: "json"
annotation_schemes:
- name: "quality"
description: "Overall QUALITY of this response (0=worst, 4=best):"
annotation_type: likert
min_label: "0 - Very poor"
max_label: "4 - Excellent"
size: 5
- name: "helpfulness"
description: "How HELPFUL is this response? (0=not helpful, 4=very helpful)"
annotation_type: likert
min_label: "0 - Not helpful"
max_label: "4 - Very helpful"
size: 5
- name: "creativity"
description: "How CREATIVE is this response? (0=generic, 4=highly creative)"
annotation_type: likert
min_label: "0 - Generic"
max_label: "4 - Highly creative"
size: 5
- name: "humor"
description: "If applicable, how HUMOROUS is this response? (0=not funny, 4=very funny)"
annotation_type: likert
min_label: "0 - Not funny"
max_label: "4 - Very funny"
size: 5
- name: "truthfulness"
description: "How TRUTHFUL is this response?"
annotation_type: radio
labels:
- "Completely truthful"
- "Mostly truthful"
- "Partially truthful"
- "Mostly false"
- "Completely false"
- "Cannot assess"
- name: "harmlessness"
description: "Is this response HARMLESS?"
annotation_type: radio
labels:
- "Completely harmless"
- "Mostly harmless"
- "Potentially problematic"
- "Harmful"
- "Very harmful"
- name: "spam"
description: "Is this response SPAM or low-effort?"
annotation_type: radio
labels:
- "Not spam - genuine response"
- "Low effort but not spam"
- "Likely spam"
- "Definite spam"
- name: "lang_mismatch"
description: "Is the response in the WRONG LANGUAGE?"
annotation_type: radio
labels:
- "Correct language"
- "Wrong language"
- "Mixed languages"
- name: "feedback"
description: "Optional feedback for the response author:"
annotation_type: text
allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 3
annotation_instructions: |
## OpenAssistant Conversation Quality
Rate AI assistant responses across multiple quality dimensions.
### Rating Scales (0-4):
**Quality**: Overall response quality
- 4: Excellent - couldn't be better
- 3: Good - minor issues only
- 2: Acceptable - noticeable issues
- 1: Poor - significant problems
- 0: Very poor - fails completely
**Helpfulness**: Does it help the user?
- Consider the user's apparent goal
- Is information actionable?
- Does it address the question?
**Creativity**: Is it creative/original?
- Only rate if relevant (skip for factual Q&A)
- Novel approaches or perspectives
- Engaging presentation
**Humor**: Is it funny?
- Only rate if humor was attempted
- N/A for serious responses
### Quality Flags:
**Truthfulness**: Is the content accurate?
- "Cannot assess" for opinions or unknowable things
**Harmlessness**: Could this cause harm?
- Consider advice quality
- Potential for misuse
- Respectfulness
**Spam**: Is this low-quality filler?
- Nonsense or irrelevant text
- Copied content
- No genuine attempt to help
### Guidelines:
- Rate based on the specific context
- A simple question needs a simple answer
- Don't penalize brevity if complete
Sample Datasample-data.json
[
{
"id": "oasst_001",
"conversation": [
{
"role": "user",
"content": "What's a good way to learn a new programming language?"
},
{
"role": "assistant",
"content": "Here are some effective strategies for learning a new programming language:\n\n1. **Start with the basics**: Learn syntax, data types, and control structures through tutorials or documentation.\n\n2. **Practice with small projects**: Build simple programs like calculators, to-do lists, or games.\n\n3. **Read others' code**: Study open-source projects to see real-world patterns.\n\n4. **Use interactive platforms**: Sites like Codecademy, LeetCode, or Exercism offer hands-on practice.\n\n5. **Build something you care about**: Personal projects keep you motivated.\n\n6. **Join a community**: Forums, Discord servers, and local meetups provide support and feedback.\n\nThe key is consistent practice - even 30 minutes daily adds up quickly!"
}
]
},
{
"id": "oasst_002",
"conversation": [
{
"role": "user",
"content": "Tell me a joke about computers"
},
{
"role": "assistant",
"content": "Why do programmers prefer dark mode?\n\nBecause light attracts bugs! 🐛"
}
]
}
]Get This Design
View on GitHub
Clone or download from the repository
Quick start:
git clone https://github.com/davidjurgens/potato-showcase.git cd potato-showcase/preference-learning/oasst-conversation-quality potato start config.yaml
Details
Annotation Types
likertradiotext
Domain
Conversational AINatural Language Processing
Use Cases
RLHFQuality RatingConversation Evaluation
Tags
preferencequalityconversationassistantopen-sourcecrowdsourced
Found an issue or want to improve this design?
Open an IssueRelated Designs
Constitutional AI Harmlessness Evaluation
Evaluate AI assistant responses for harmlessness and helpfulness based on the Constitutional AI framework by Anthropic. Annotators rate responses on a harmfulness scale, assess helpfulness, and provide explanations for their judgments.
radiolikert
AlpacaFarm Preference Simulation
Simulate human preferences for instruction-following responses. Create preference data for efficient RLHF research and LLM evaluation.
likertradio
Audio Transcription Review
Review and correct automatic speech recognition transcriptions with waveform visualization.
likertmultiselect