HomeComputation & LanguagearXiv:2605.29797

Metric-Dependent Annotation Saturation for Learning from Label Distributions

Computation & Language2026-05v1license

Abstract

When annotators disagree on a label, the disagreement itself carries signal -- and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from ChaosNLI, a dataset providing 100 independent annotator judgments per item, and identify metric-dependent saturation. In our 3-class NLI setting, entropy correlation -- whether the model identifies which items elicit disagreement -- requires N ~ 20-50 annotators to converge, while distributional match (KL divergence) saturates by N ~ 10 (87-95% of improvement across five model seeds). This finding rests on a prior observation: soft labels carry item-specific signal that label smoothing cannot replicate. Across five smoothing intensities, entropy correlation clusters at r ~ 0.45-0.49, while soft labels reach r = 0.643 (p < 0.001); per-item analysis traces this gap to smoothing's inability to distinguish ambiguous items from clear ones. The soft-label advantage replicates across two architectures (DeBERTa, RoBERTa), a non-NLI-pretrained baseline, and an exploratory cross-domain evaluation on content safety. These results suggest that annotation budgets should be informed by the target evaluation metric rather than set uniformly.

Comments: 16 pages, 3 figures, 14 tables

Cite

@article{arxiv.2605.29797,
  title  = {Metric-Dependent Annotation Saturation for Learning from Label Distributions},
  author = {Guneet Kohli},
  journal= {arXiv preprint arXiv:2605.29797},
  year   = {2026}
}