Training Phase

Train and qualify annotators with practice questions before the main task.

Training Phase

Potato 2.0 includes an optional training phase that helps qualify annotators before they begin the main annotation task. Annotators answer practice questions with known correct answers and receive feedback on their performance.

Use Cases

Ensure annotators understand the task
Filter out low-quality annotators
Provide guided practice before real annotations
Collect baseline quality metrics
Teach annotation guidelines through examples

How It Works

Annotators complete a set of training questions
They receive immediate feedback on each answer
Progress is tracked against passing criteria
Only annotators who pass can proceed to the main task

Configuration

Basic Setup

yaml

phases:
  training:
    enabled: true
    data_file: "data/training_data.json"
    schema_name: sentiment  # Which annotation scheme to train
 
    # Passing criteria
    passing_criteria:
      min_correct: 8  # Must get at least 8 correct
      total_questions: 10

Full Configuration

yaml

phases:
  training:
    enabled: true
    data_file: "data/training_data.json"
    schema_name: sentiment
 
    passing_criteria:
      # Different criteria options (choose one or combine)
      min_correct: 8
      require_all_correct: false
      max_mistakes: 3
      max_mistakes_per_question: 2
 
    # Allow retries
    retries:
      enabled: true
      max_retries: 3
 
    # Show explanations for incorrect answers
    show_explanations: true
 
    # Randomize question order
    randomize: true

Passing Criteria

You can set various criteria for passing the training phase:

Minimum Correct

yaml

passing_criteria:
  min_correct: 8
  total_questions: 10

Annotator must answer at least 8 out of 10 questions correctly.

Require All Correct

yaml

passing_criteria:
  require_all_correct: true

Annotator must answer every question correctly to pass.

Maximum Mistakes

yaml

passing_criteria:
  max_mistakes: 3

Annotator is disqualified after 3 total mistakes.

Maximum Mistakes Per Question

yaml

passing_criteria:
  max_mistakes_per_question: 2

Annotator is disqualified after 2 mistakes on any single question.

Combined Criteria

yaml

passing_criteria:
  min_correct: 8
  max_mistakes_per_question: 3

Must get 8 correct AND not fail any single question more than 3 times.

Training Data Format

Training data must include correct answers and optional explanations:

json

[
  {
    "id": "train_1",
    "text": "I absolutely love this product! Best purchase ever!",
    "correct_answers": {
      "sentiment": "Positive"
    },
    "explanation": "This text expresses strong positive sentiment with words like 'love' and 'best'."
  },
  {
    "id": "train_2",
    "text": "This is the worst service I've ever experienced.",
    "correct_answers": {
      "sentiment": "Negative"
    },
    "explanation": "The words 'worst' and the overall complaint indicate negative sentiment."
  },
  {
    "id": "train_3",
    "text": "The package arrived on time.",
    "correct_answers": {
      "sentiment": "Neutral"
    },
    "explanation": "This is a factual statement without emotional indicators."
  }
]

Multiple Schema Training

For tasks with multiple annotation schemes:

json

{
  "id": "train_1",
  "text": "Apple announced new iPhone features yesterday.",
  "correct_answers": {
    "sentiment": "Neutral",
    "topic": "Technology"
  },
  "explanation": {
    "sentiment": "This is a factual news statement.",
    "topic": "The text discusses Apple and iPhone, which are tech topics."
  }
}

User Experience

Training Flow

User sees "Training Phase" indicator
Question is displayed with annotation form
User submits their answer
Feedback is shown immediately:
- Correct: Green checkmark, proceed to next
- Incorrect: Red X, explanation shown, retry option

Feedback Display

When an annotator answers incorrectly:

The correct answer is highlighted
The provided explanation is shown
Retry button appears (if retries enabled)
Progress toward passing criteria is displayed

Admin Monitoring

Track training performance in the admin dashboard:

Completion rates
Average correct answers
Pass/fail rates
Time spent on training
Per-question accuracy

Access via /admin API endpoints:

text

GET /api/admin/training/stats
GET /api/admin/training/user/{user_id}

Example: Sentiment Analysis Training

yaml

task_name: "Sentiment Analysis"
task_dir: "."
port: 8000
 
# Main annotation data
data_files:
  - "data/reviews.json"
 
item_properties:
  id_key: id
  text_key: text
 
annotation_schemes:
  - annotation_type: radio
    name: sentiment
    description: "What is the sentiment of this review?"
    labels:
      - Positive
      - Negative
      - Neutral
 
# Training phase configuration
phases:
  training:
    enabled: true
    data_file: "data/training_questions.json"
    schema_name: sentiment
 
    passing_criteria:
      min_correct: 8
      total_questions: 10
      max_mistakes_per_question: 2
 
    retries:
      enabled: true
      max_retries: 3
 
    show_explanations: true
    randomize: true
 
output_annotation_dir: "output/"
output_annotation_format: "json"
allow_all_users: true

Example: NER Training

yaml

annotation_schemes:
  - annotation_type: span
    name: entities
    description: "Highlight named entities"
    labels:
      - Person
      - Organization
      - Location
      - Date
 
phases:
  training:
    enabled: true
    data_file: "data/ner_training.json"
    schema_name: entities
 
    passing_criteria:
      min_correct: 7
      total_questions: 10
 
    show_explanations: true

Training data for span annotation:

json

{
  "id": "train_1",
  "text": "Tim Cook announced that Apple will open a new store in New York on March 15.",
  "correct_answers": {
    "entities": [
      {"start": 0, "end": 8, "label": "Person"},
      {"start": 24, "end": 29, "label": "Organization"},
      {"start": 54, "end": 62, "label": "Location"},
      {"start": 66, "end": 74, "label": "Date"}
    ]
  },
  "explanation": "Tim Cook is a Person, Apple is an Organization, New York is a Location, and March 15 is a Date."
}

Best Practices

1. Start Simple

Begin with straightforward examples before introducing edge cases:

json

[
  {"text": "I love this!", "correct_answers": {"sentiment": "Positive"}},
  {"text": "I hate this!", "correct_answers": {"sentiment": "Negative"}},
  {"text": "It arrived yesterday.", "correct_answers": {"sentiment": "Neutral"}}
]

2. Cover All Labels

Ensure training includes examples of every possible label:

json

[
  {"correct_answers": {"sentiment": "Positive"}},
  {"correct_answers": {"sentiment": "Negative"}},
  {"correct_answers": {"sentiment": "Neutral"}}
]

3. Write Clear Explanations

Explanations should teach the annotation guidelines:

json

{
  "explanation": "While this text mentions a problem, the overall tone is constructive and the reviewer expresses satisfaction with the resolution. This makes it Positive rather than Negative."
}

4. Set Reasonable Criteria

Don't require perfection unnecessarily:

yaml

# Too strict - may lose good annotators
passing_criteria:
  require_all_correct: true
 
# Better - allows for learning
passing_criteria:
  min_correct: 8
  total_questions: 10

5. Include Edge Cases

Add tricky examples to prepare annotators:

json

{
  "text": "Not bad at all, I guess it could be worse.",
  "correct_answers": {"sentiment": "Neutral"},
  "explanation": "Despite negative words like 'not bad' and 'worse', this is actually a lukewarm endorsement - neutral rather than positive or negative."
}

Integration with Workflows

Training integrates with multi-phase workflows:

yaml

phases:
  consent:
    enabled: true
    data_file: "data/consent.json"
 
  prestudy:
    enabled: true
    data_file: "data/demographics.json"
 
  instructions:
    enabled: true
    content: "data/instructions.html"
 
  training:
    enabled: true
    data_file: "data/training.json"
    schema_name: sentiment
    passing_criteria:
      min_correct: 8
 
  annotation:
    # Main task - always enabled
    enabled: true
 
  poststudy:
    enabled: true
    data_file: "data/feedback.json"

Performance Considerations

Training data is loaded at startup
Progress is stored in memory per session
Minimal performance impact on main annotation
Consider separating complex training into multiple phases

Training Phase

Training Phase

Use Cases

How It Works

Configuration

Basic Setup

Full Configuration

Passing Criteria

Minimum Correct

Require All Correct

Maximum Mistakes

Maximum Mistakes Per Question

Combined Criteria

Training Data Format

Multiple Schema Training

User Experience

Training Flow

Feedback Display

Admin Monitoring

Example: Sentiment Analysis Training

Example: NER Training

Best Practices

1. Start Simple

2. Cover All Labels

3. Write Clear Explanations

4. Set Reasonable Criteria

5. Include Edge Cases

Integration with Workflows

Performance Considerations

Further Reading