Showcase/OpenAssistant Conversation Quality

intermediatepreference

OpenAssistant Conversation Quality

Rate AI assistant responses across multiple quality dimensions. Evaluate conversations for the OpenAssistant crowdsourced dataset.

Configuration Fileconfig.yaml

This Potato config reproduces the annotation task. Save it as config.yaml and run potato start config.yaml to try it.

yaml

# OpenAssistant Conversation Quality Configuration
# Based on Köpf et al., NeurIPS 2023
# Task: Rate conversation quality across multiple dimensions

annotation_task_name: "OpenAssistant Conversation Quality"
task_dir: "."

data_files:
  - data.json
item_properties:
  id_key: "id"
  text_key: "text"

output_annotation_dir: "annotation_output/"
output_annotation_format: "json"

annotation_schemes:
  - name: "quality"
    description: "Overall QUALITY of this response (0=worst, 4=best):"
    annotation_type: likert
    min_label: "0 - Very poor"
    max_label: "4 - Excellent"
    size: 5

  - name: "helpfulness"
    description: "How HELPFUL is this response? (0=not helpful, 4=very helpful)"
    annotation_type: likert
    min_label: "0 - Not helpful"
    max_label: "4 - Very helpful"
    size: 5

  - name: "creativity"
    description: "How CREATIVE is this response? (0=generic, 4=highly creative)"
    annotation_type: likert
    min_label: "0 - Generic"
    max_label: "4 - Highly creative"
    size: 5

  - name: "humor"
    description: "If applicable, how HUMOROUS is this response? (0=not funny, 4=very funny)"
    annotation_type: likert
    min_label: "0 - Not funny"
    max_label: "4 - Very funny"
    size: 5

  - name: "truthfulness"
    description: "How TRUTHFUL is this response?"
    annotation_type: radio
    labels:
      - "Completely truthful"
      - "Mostly truthful"
      - "Partially truthful"
      - "Mostly false"
      - "Completely false"
      - "Cannot assess"

  - name: "harmlessness"
    description: "Is this response HARMLESS?"
    annotation_type: radio
    labels:
      - "Completely harmless"
      - "Mostly harmless"
      - "Potentially problematic"
      - "Harmful"
      - "Very harmful"

  - name: "spam"
    description: "Is this response SPAM or low-effort?"
    annotation_type: radio
    labels:
      - "Not spam - genuine response"
      - "Low effort but not spam"
      - "Likely spam"
      - "Definite spam"

  - name: "lang_mismatch"
    description: "Is the response in the WRONG LANGUAGE?"
    annotation_type: radio
    labels:
      - "Correct language"
      - "Wrong language"
      - "Mixed languages"

  - name: "feedback"
    description: "Optional feedback for the response author:"
    annotation_type: text

allow_all_users: true
instances_per_annotator: 100
annotation_per_instance: 3

annotation_instructions: |
  ## OpenAssistant Conversation Quality

  Rate AI assistant responses across multiple quality dimensions.

  ### Rating Scales (0-4):

  **Quality**: Overall response quality
  - 4: Excellent - couldn't be better
  - 3: Good - minor issues only
  - 2: Acceptable - noticeable issues
  - 1: Poor - significant problems
  - 0: Very poor - fails completely

  **Helpfulness**: Does it help the user?
  - Consider the user's apparent goal
  - Is information actionable?
  - Does it address the question?

  **Creativity**: Is it creative/original?
  - Only rate if relevant (skip for factual Q&A)
  - Novel approaches or perspectives
  - Engaging presentation

  **Humor**: Is it funny?
  - Only rate if humor was attempted
  - N/A for serious responses

  ### Quality Flags:

  **Truthfulness**: Is the content accurate?
  - "Cannot assess" for opinions or unknowable things

  **Harmlessness**: Could this cause harm?
  - Consider advice quality
  - Potential for misuse
  - Respectfulness

  **Spam**: Is this low-quality filler?
  - Nonsense or irrelevant text
  - Copied content
  - No genuine attempt to help

  ### Guidelines:
  - Rate based on the specific context
  - A simple question needs a simple answer
  - Don't penalize brevity if complete

Sample Datasample-data.json

json

[
  {
    "id": "oasst_001",
    "conversation": [
      {
        "role": "user",
        "content": "What's a good way to learn a new programming language?"
      },
      {
        "role": "assistant",
        "content": "Here are some effective strategies for learning a new programming language:\n\n1. **Start with the basics**: Learn syntax, data types, and control structures through tutorials or documentation.\n\n2. **Practice with small projects**: Build simple programs like calculators, to-do lists, or games.\n\n3. **Read others' code**: Study open-source projects to see real-world patterns.\n\n4. **Use interactive platforms**: Sites like Codecademy, LeetCode, or Exercism offer hands-on practice.\n\n5. **Build something you care about**: Personal projects keep you motivated.\n\n6. **Join a community**: Forums, Discord servers, and local meetups provide support and feedback.\n\nThe key is consistent practice - even 30 minutes daily adds up quickly!"
      }
    ]
  },
  {
    "id": "oasst_002",
    "conversation": [
      {
        "role": "user",
        "content": "Tell me a joke about computers"
      },
      {
        "role": "assistant",
        "content": "Why do programmers prefer dark mode?\n\nBecause light attracts bugs! 🐛"
      }
    ]
  }
]

Try it live — no install

Boot the real Potato server in your browser (WebAssembly) and annotate with this exact config. Nothing leaves your machine.

▶ Run live in your browser

Get This Design

View on GitHub

Clone or download from the repository

Quick start:

git clone https://github.com/davidjurgens/potato-showcase.git
cd potato-showcase/preference-learning/oasst-conversation-quality
potato start config.yaml

Dataset & paper

Köpf et al., NeurIPS 2023

Official dataset ↗Read the paper ↗

Citation (BibTeX)

bibtex

@article{kopf2023openassistant,
  title={OpenAssistant Conversations--Democratizing Large Language Model Alignment},
  author={K{\"o}pf, Andreas and Kilcher, Yannic and von R{\"u}tte, Dimitri and Anagnostidis, Sotiris and Tam, Zhi-Rui and Stevens, Keith and Barhoum, Abdullah and Duc, Nguyen Minh and Stanley, Oliver and Nagyfi, Rich{\'a}rd and others},
  journal={arXiv preprint arXiv:2304.07327},
  year={2023}
}

Details

Annotation Types

likertradiotext

Domain

Conversational AINatural Language Processing

Use Cases

RLHFQuality RatingConversation Evaluation

Related Designs

Constitutional AI Harmlessness Evaluation

Evaluate AI assistant responses for harmlessness and helpfulness based on the Constitutional AI framework by Anthropic. Annotators rate responses on a harmfulness scale, assess helpfulness, and provide explanations for their judgments.

radiolikert

AgentBoard Progress Scoring

Assess multi-turn LLM agent progress by identifying achieved milestones, scoring overall progress, categorizing the agent environment, and noting partial progress observations.

multiselectlikert

AlpacaFarm Preference Simulation

Simulate human preferences for instruction-following responses. Create preference data for efficient RLHF research and LLM evaluation.

likertradio