Related papers: Structurally Diverse Sampling for Sample-Efficient…

Diverse Demonstrations Improve In-context Compositional Generalization

In-context learning has shown great success in i.i.d semantic parsing splits, where the training and test sets are drawn from the same distribution. In this setup, models are typically prompted with demonstrations that are similar to the…

Computation and Language · Computer Science 2023-06-27 Itay Levy , Ben Bogin , Jonathan Berant

Finding needles in a haystack: Sampling Structurally-diverse Training Sets from Synthetic Data for Compositional Generalization

Modern semantic parsers suffer from two principal limitations. First, training requires expensive collection of utterance-program pairs. Second, semantic parsers fail to generalize at test time to new compositions/structures that have not…

Computation and Language · Computer Science 2021-09-07 Inbar Oren , Jonathan Herzig , Jonathan Berant

The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks

NLP models have progressed drastically in recent years, according to numerous datasets proposed to evaluate performance. Questions remain, however, about how particular dataset design choices may impact the conclusions we draw about model…

Computation and Language · Computer Science 2023-10-27 Kaiser Sun , Adina Williams , Dieuwke Hupkes

A Principled Framework for Evaluating on Typologically Diverse Languages

Beyond individual languages, multilingual natural language processing (NLP) research increasingly aims to develop models that perform well across languages generally. However, evaluating these systems on all the world's languages is…

Computation and Language · Computer Science 2025-09-09 Esther Ploeger , Wessel Poelman , Andreas Holck Høeg-Petersen , Anders Schlichtkrull , Miryam de Lhoneux , Johannes Bjerva

Learning to Generalize Compositionally by Transferring Across Semantic Parsing Tasks

Neural network models often generalize poorly to mismatched domains or distributions. In NLP, this issue arises in particular when models are expected to generalize compositionally, that is, to novel combinations of familiar words and…

Computation and Language · Computer Science 2021-11-10 Wang Zhu , Peter Shaw , Tal Linzen , Fei Sha

Semantic-based Distributed Learning for Diverse and Discriminative Representations

In large-scale distributed scenarios, increasingly complex tasks demand more intelligent collaboration across networks, requiring the joint extraction of structural representations from data samples. However, conventional task-specific…

Machine Learning · Computer Science 2026-04-21 Zhuojun Tian , Chaouki Ben Issaid , Mehdi Bennis

Meta-Ensemble Learning with Diverse Data Splits for Improved Respiratory Sound Classification

Training reliable respiratory sound classification models remains challenging due to the limited size and subject diversity of datasets. Ensemble methods can improve robustness, but when base models are trained on identical data, models…

Machine Learning · Computer Science 2026-04-28 June-Woo Kim , Miika Toikkanen , Heejoon Koo , Yoon Tae Kim , Doyoung Kwon , Kyunghoon Kim

Structural-Entropy-Based Sample Selection for Efficient and Effective Learning

Sample selection improves the efficiency and effectiveness of machine learning models by providing informative and representative samples. Typically, samples can be modeled as a sample graph, where nodes are samples and edges represent…

Machine Learning · Computer Science 2025-03-04 Tianchi Xie , Jiangning Zhu , Guozu Ma , Minzhi Lin , Wei Chen , Weikai Yang , Shixia Liu

Automatic Music Sample Identification with Multi-Track Contrastive Learning

Sampling, the technique of reusing pieces of existing audio tracks to create new music content, is a very common practice in modern music production. In this paper, we tackle the challenging task of automatic sample identification, that is,…

Sound · Computer Science 2025-10-28 Alain Riou , Joan Serrà , Yuki Mitsufuji

Invariant Structure Learning for Better Generalization and Causal Explainability

Learning the causal structure behind data is invaluable for improving generalization and obtaining high-quality explanations. We propose a novel framework, Invariant Structure Learning (ISL), that is designed to improve causal structure…

Machine Learning · Computer Science 2022-06-15 Yunhao Ge , Sercan Ö. Arik , Jinsung Yoon , Ao Xu , Laurent Itti , Tomas Pfister

The Effect of Data Partitioning Strategy on Model Generalizability: A Case Study of Morphological Segmentation

Recent work to enhance data partitioning strategies for more realistic model evaluation face challenges in providing a clear optimal choice. This study addresses these challenges, focusing on morphological segmentation and synthesizing…

Computation and Language · Computer Science 2024-04-16 Zoey Liu , Bonnie J. Dorr

Are Sample-Efficient NLP Models More Robust?

Recent results in image classification and extractive question answering have observed that pre-trained models trained on less in-distribution data have better out-of-distribution performance. However, it is unclear how broadly these trends…

Computation and Language · Computer Science 2023-06-01 Nelson F. Liu , Ananya Kumar , Percy Liang , Robin Jia

Learning Diverse Representations for Fast Adaptation to Distribution Shift

The i.i.d. assumption is a useful idealization that underpins many successful approaches to supervised machine learning. However, its violation can lead to models that learn to exploit spurious correlations in the training data, rendering…

Machine Learning · Computer Science 2020-06-15 Daniel Pace , Alessandra Russo , Murray Shanahan

Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes…

Computation and Language · Computer Science 2024-09-18 Simon Yu , Liangyu Chen , Sara Ahmadian , Marzieh Fadaee

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While…

Computation and Language · Computer Science 2023-08-24 Kushal Tirumala , Daniel Simig , Armen Aghajanyan , Ari S. Morcos

Learning from Incomplete Features by Simultaneous Training of Neural Networks and Sparse Coding

In this paper, the problem of training a classifier on a dataset with incomplete features is addressed. We assume that different subsets of features (random or structured) are available at each data instance. This situation typically occurs…

Machine Learning · Computer Science 2021-04-20 Cesar F. Caiafa , Ziyao Wang , Jordi Solé-Casals , Qibin Zhao

Enhancing NLP Robustness and Generalization through LLM-Generated Contrast Sets: A Scalable Framework for Systematic Evaluation and Adversarial Training

Standard NLP benchmarks often fail to capture vulnerabilities stemming from dataset artifacts and spurious correlations. Contrast sets address this gap by challenging models near decision boundaries but are traditionally labor-intensive to…

Computation and Language · Computer Science 2025-03-11 Hender Lin

Instance-Wise Adaptive Sampling for Dataset Construction in Approximating Inverse Problem Solutions

We propose an instance-wise adaptive sampling framework for constructing compact and informative training datasets for supervised learning of inverse problem solutions. Typical learning-based approaches aim to learn a general-purpose…

Machine Learning · Computer Science 2026-02-20 Jiequn Han , Kui Ren , Nathan Soedjak

Measuring and Improving Compositional Generalization in Text-to-SQL via Component Alignment

In text-to-SQL tasks -- as in much of NLP -- compositional generalization is a major challenge: neural networks struggle with compositional generalization where training and test distributions differ. However, most recent attempts to…

Computation and Language · Computer Science 2022-05-05 Yujian Gan , Xinyun Chen , Qiuping Huang , Matthew Purver

Pretraining Frequency Predicts Compositional Generalization of CLIP on Real-World Tasks

We investigate the success conditions for compositional generalization of CLIP models on real-world data through performance prediction. Prior work shows that CLIP requires exponentially more pretraining data for linear performance gains on…

Machine Learning · Computer Science 2025-02-26 Thaddäus Wiedemer , Yash Sharma , Ameya Prabhu , Matthias Bethge , Wieland Brendel