Related papers: Efficient Sequence Packing without Cross-contamina…

Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training

Recent breakthroughs and successful deployment of large language and vision models in a constrained environment predominantly follow a two phase approach. First, large models are trained to achieve peak performance, followed by a model…

Machine Learning · Computer Science 2024-11-22 Hanna Mazzawi , Pranjal Awasthi , Xavi Gonzalvo , Srikumar Ramalingam

Weighted Sampling for Masked Language Modeling

Masked Language Modeling (MLM) is widely used to pretrain language models. The standard random masking strategy in MLM causes the pre-trained language models (PLMs) to be biased toward high-frequency tokens. Representation learning of rare…

Computation and Language · Computer Science 2023-05-25 Linhan Zhang , Qian Chen , Wen Wang , Chong Deng , Xin Cao , Kongzhang Hao , Yuxin Jiang , Wei Wang

Training Long-Context LLMs Efficiently via Chunk-wise Optimization

While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose \textit{Sequential…

Machine Learning · Computer Science 2025-05-23 Wenhao Li , Yuxin Zhang , Gen Luo , Daohai Yu , Rongrong Ji

MML: Maximal Multiverse Learning for Robust Fine-Tuning of Language Models

Recent state-of-the-art language models utilize a two-phase training procedure comprised of (i) unsupervised pre-training on unlabeled text, and (ii) fine-tuning for a specific supervised task. More recently, many studies have been focused…

Computation and Language · Computer Science 2019-11-15 Itzik Malkiel , Lior Wolf

Scaling Sentence Embeddings with Large Language Models

Large language models (LLMs) have recently garnered significant interest. With in-context learning, LLMs achieve impressive results in various natural language tasks. However, the application of LLMs to sentence embeddings remains an area…

Computation and Language · Computer Science 2023-08-01 Ting Jiang , Shaohan Huang , Zhongzhi Luan , Deqing Wang , Fuzhen Zhuang

Advances in Very Deep Convolutional Neural Networks for LVCSR

Very deep CNNs with small 3x3 kernels have recently been shown to achieve very strong performance as acoustic models in hybrid NN-HMM speech recognition systems. In this paper we investigate how to efficiently scale these models to larger…

Computation and Language · Computer Science 2016-06-28 Tom Sercu , Vaibhava Goel

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the intermediate layers by pooling fixed-length segments…

Computation and Language · Computer Science 2023-10-25 Piotr Nawrot , Jan Chorowski , Adrian Łańcucki , Edoardo M. Ponti

KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work,…

Computation and Language · Computer Science 2025-10-15 Xinping Zhao , Xinshuo Hu , Zifei Shan , Shouzheng Huang , Yao Zhou , Xin Zhang , Zetian Sun , Zhenyu Liu , Dongfang Li , Xinyuan Wei , Youcheng Pan , Yang Xiang , Meishan Zhang , Haofen Wang , Jun Yu , Baotian Hu , Min Zhang

A Law of Next-Token Prediction in Large Language Models

Large language models (LLMs) have been widely employed across various application domains, yet their black-box nature poses significant challenges to understanding how these models process input data internally to make predictions. In this…

Machine Learning · Computer Science 2025-09-03 Hangfeng He , Weijie J. Su

Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?

Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with…

Computation and Language · Computer Science 2025-05-30 Zichen Wen , Yifeng Gao , Weijia Li , Conghui He , Linfeng Zhang

Comparing Specialised Small and General Large Language Models on Text Classification: 100 Labelled Samples to Achieve Break-Even Performance

When solving NLP tasks with limited labelled data, researchers typically either use a general large language model without further update, or use a small number of labelled samples to tune a specialised smaller model. In this work, we…

Computation and Language · Computer Science 2026-01-26 Branislav Pecher , Ivan Srba , Maria Bielikova

Accelerating Multilingual Language Model for Excessively Tokenized Languages

Recent advancements in large language models (LLMs) have remarkably enhanced performances on a variety of tasks in multiple languages. However, tokenizers in LLMs trained primarily on English-centric corpora often overly fragment a text…

Computation and Language · Computer Science 2024-08-07 Jimin Hong , Gibbeum Lee , Jaewoong Cho

Large Language Models as General Pattern Machines

We observe that pre-trained large language models (LLMs) are capable of autoregressively completing complex token sequences -- from arbitrary ones procedurally generated by probabilistic context-free grammars (PCFG), to more rich spatial…

Artificial Intelligence · Computer Science 2023-10-27 Suvir Mirchandani , Fei Xia , Pete Florence , Brian Ichter , Danny Driess , Montserrat Gonzalez Arenas , Kanishka Rao , Dorsa Sadigh , Andy Zeng

SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization

Pretraining large language models (LLMs) with next-token prediction has led to remarkable advances, yet the context-dependent nature of token embeddings in such models results in high intra-class variance and inter-class similarity, thus…

Computation and Language · Computer Science 2026-05-12 Yan Sun , Guoxia Wang , Jinle Zeng , JiaBin Yang , Shuai Li , Li Shen , Dacheng Tao , DianHai Yu , Haifeng Wang

Predictive Batch Scheduling: Accelerating Language Model Training Through Loss-Aware Sample Prioritization

We introduce Predictive Batch Scheduling (PBS), a novel training optimization technique that accelerates language model convergence by dynamically prioritizing high-loss samples during batch construction. Unlike curriculum learning…

Artificial Intelligence · Computer Science 2026-02-20 Sumedh Rasal

Topic Modeling with Fine-tuning LLMs and Bag of Sentences

Large language models (LLMs) are increasingly used for topic modeling, outperforming classical topic models such as LDA. Commonly, pre-trained LLM encoders such as BERT are used out-of-the-box despite the fact that fine-tuning is known to…

Computation and Language · Computer Science 2026-02-23 Johannes Schneider

In-Context Learning with Many Demonstration Examples

Large pre-training language models (PLMs) have shown promising in-context learning abilities. However, due to the backbone transformer architecture, existing PLMs are bottlenecked by the memory and computational cost when scaling up to a…

Computation and Language · Computer Science 2023-02-13 Mukai Li , Shansan Gong , Jiangtao Feng , Yiheng Xu , Jun Zhang , Zhiyong Wu , Lingpeng Kong

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

The advent of Large Multimodal Models (LMMs) has significantly enhanced Large Language Models (LLMs) to process and interpret diverse data modalities (e.g., image and video). However, as input complexity increases, particularly with long…

Computer Vision and Pattern Recognition · Computer Science 2025-12-23 Shilin Yan , Jiaming Han , Joey Tsai , Hongwei Xue , Rongyao Fang , Lingyi Hong , Ziyu Guo , Ray Zhang

Training for Fast Sequential Prediction Using Dynamic Feature Selection

We present paired learning and inference algorithms for significantly reducing computation and increasing speed of the vector dot products in the classifiers that are at the heart of many NLP components. This is accomplished by partitioning…

Computation and Language · Computer Science 2014-12-23 Emma Strubell , Luke Vilnis , Andrew McCallum

A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets

We introduce a scaling law for fine-tuning large language models (LLMs) under fixed compute budgets that explicitly accounts for data composition. Conventional approaches measure training data solely by total tokens, yet the number of…

Computation and Language · Computer Science 2025-06-04 Ryan Lagasse , Aidan Kierans , Avijit Ghosh , Shiri Dori-Hacohen