English

LLMs can construct powerful representations and streamline sample-efficient supervised learning

Artificial Intelligence 2026-05-22 v3

Abstract

As real-world datasets become more complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data, such as time-series, free text, and structured records, often requires non-trivial domain expertise. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned interpretive summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric approaches significantly outperform count-feature models, naive LLM baselines, and a clinical foundation model pretrained on orders of magnitude more data. Beyond performance, rubrics offer operational advantages such as being easy to audit, cost-effectiveness at scale, and facilitating tabular representations.

Keywords

Cite

@article{arxiv.2603.11679,
  title  = {LLMs can construct powerful representations and streamline sample-efficient supervised learning},
  author = {Ilker Demirel and Lawrence Shi and Zeshan Hussain and David Sontag},
  journal= {arXiv preprint arXiv:2603.11679},
  year   = {2026}
}
R2 v1 2026-07-01T11:16:12.659Z