# Human Evaluation of Generated Text

Source: https://www.potatoannotator.com/docs/guides/human-evaluation-generated-text

**Automatic metrics like BLEU and ROUGE correlate weakly with how good generated text actually is, so human evaluation is still the standard, and it is done badly more often than not. The three things that separate a trustworthy human eval from a decorative one: define each criterion precisely, prefer relative judgments over absolute scores, and report enough detail that someone else could rerun it.** This guide is the protocol, not the rubric wording.

## Why human evaluation, and why it's hard to trust

For open-ended generation, summaries, dialogue, translations, LLM responses, [automatic metrics](https://en.wikipedia.org/wiki/Evaluation_of_machine_translation) compare against reference texts and miss most of what matters: a fluent, faithful answer phrased differently from the reference scores badly, and a fluent lie scores well. So human judgment remains the ground truth. The catch is that human evaluation is itself a measurement instrument, and a poorly designed one produces numbers as noisy as the metrics it replaced.

The scale of the problem is documented. [Howcroft et al. (2020)](https://aclanthology.org/2020.inlg-1.23/) surveyed twenty years of NLG evaluations and found the field could not even agree on what its own criteria meant: terms like "fluency," "adequacy," and "naturalness" were defined differently (or not at all) across papers, making results impossible to compare. Their fix is the starting point for any serious eval, pin down exactly what each criterion means before you collect a single judgment.

## Define the criteria, precisely

Vague criteria are where most human evals go wrong. "Rate the quality from 1 to 5" invites every annotator to invent their own definition of quality. Split it into named, separately-defined dimensions, and write a one-sentence operational definition for each:

- **Fluency**: is the text grammatical and well-formed, ignoring whether it's correct?
- **Coherence**: do the sentences follow each other sensibly as a whole?
- **Faithfulness / factual accuracy**: is every claim supported by the source (for summarization/RAG) or true (for open generation)? This is where [hallucinations](/docs/guides/detecting-hallucinations) get caught.
- **Relevance**: does it actually address the prompt?
- **Helpfulness**: for assistant-style tasks, does it accomplish what the user wanted?

Measuring these separately tells you *why* one system beats another, not just that it did.

## Absolute scores or relative comparisons

The single biggest design choice is whether annotators rate one output at a time or compare several.

- **Absolute ([Likert](/docs/guides/rating-scales)) ratings** are simple but suffer from scale bias: annotators anchor differently, avoid the extremes, and drift over a session, so a "4" from one rater isn't a "4" from another.
- **Pairwise preference** (is A or B better?) sidesteps scale bias entirely and is generally more reliable, which is why it underpins [RLHF preference data](/docs/guides/rlhf-preference-data) and [model comparison](/docs/guides/pairwise-model-comparison). The cost is that you get a ranking, not an absolute level.
- **[Best-worst scaling](/docs/guides/pairwise-and-best-worst)** shows a small set and asks only for the best and worst, which is a cheap way to get reliable rankings from few judgments.

[van der Lee et al. (2021)](https://doi.org/10.1016/j.csl.2020.101151) lay out best-practice guidelines covering exactly these choices, how many items and evaluators, which scale, which statistical analysis, and are worth reading before you commit to a design.

## Power it, and report it

Two failure modes remain even after the design is right.

First, **underpowered comparisons.** Detecting a small quality difference between two good systems takes more items than people expect; run the [power analysis](/docs/guides/statistical-power-annotation) first, use a proper significance test, and report effect sizes, not just which mean was higher.

Second, **unreported detail.** [Belz et al. (2021)](https://aclanthology.org/2021.eacl-main.29/) reviewed reproducibility in NLP and found human evaluations especially hard to reproduce, usually because the paper omits the exact criteria, instructions, annotator pool, and analysis. Record all of it as part of the study, not as an afterthought.

A few mechanics that prevent avoidable bias: **randomize output order** so position doesn't leak (people favor the first option), **blind the system identity** so annotators can't tell which model produced what, and **pilot on a small batch** to measure [agreement](/docs/guides/inter-annotator-agreement) and fix confusing criteria before scaling up.

## Doing it in Potato

Potato has a scheme for each evaluation style, so the design choice above maps straight to config. For per-criterion absolute ratings:

```yaml
annotation_schemes:
  - name: faithfulness
    annotation_type: likert
    description: "Is every claim in the response supported by the source? 1 = many unsupported, 5 = fully supported."
    size: 5
  - name: fluency
    annotation_type: likert
    description: "Is the response grammatical and well-formed?"
    size: 5
```

For a blind A/B comparison, use a `pairwise` scheme and randomize which system is shown as A:

```yaml
annotation_schemes:
  - name: preference
    annotation_type: pairwise
    description: "Which response is more helpful overall?"
    labels: ["A is better", "Tie", "B is better"]
```

For structured, multi-criterion scoring in one pass, the [`rubric_eval`](/docs/guides/rubric-based-llm-evaluation) scheme collects a score per rubric dimension. Whichever you pick, keep overlap on a shared subset so you can report agreement, and keep per-annotator labels in the [export](/docs/features/export-formats) so the significance test has the variance it needs.

## Further reading

- [Rubric-Based LLM Evaluation](/docs/guides/rubric-based-llm-evaluation), for turning fuzzy quality into scored dimensions.
- [Pairwise Model Comparison](/docs/guides/pairwise-model-comparison), for A/B evaluation at scale.
- [Statistical Power and Sample Size](/docs/guides/statistical-power-annotation), so the comparison can actually support its claim.
- [RAG Evaluation with Human Annotation](/docs/guides/rag-evaluation), for the faithfulness/relevance case specifically.
