Related papers: A general tensor-structured compression scheme for…

LatentLLM: Attention-Aware Joint Tensor Compression

Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension…

Machine Learning · Computer Science 2025-05-27 Toshiaki Koike-Akino , Xiangyu Chen , Jing Liu , Ye Wang , Pu , Wang , Matthew Brand

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational…

Computation and Language · Computer Science 2025-05-09 Weixin Liang , Lili Yu , Liang Luo , Srinivasan Iyer , Ning Dong , Chunting Zhou , Gargi Ghosh , Mike Lewis , Wen-tau Yih , Luke Zettlemoyer , Xi Victoria Lin

A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator

Large language models (LLMs) are both storage-intensive and computation-intensive, posing significant challenges when deployed on resource-constrained hardware. As linear layers in LLMs are mainly resource consuming parts, this paper…

Hardware Architecture · Computer Science 2025-02-03 Sixiao Huang , Tintin Wang , Ang Li , Ao Shen , Kai Li , Keyao Jiang , Mingqiang Huang , Hao Yu

Tensorized Embedding Layers for Efficient Model Compression

The embedding layers transforming input words into real vectors are the key components of deep neural networks used in natural language processing. However, when the vocabulary is large, the corresponding weight matrices can be enormous,…

Computation and Language · Computer Science 2020-02-20 Oleksii Hrinchuk , Valentin Khrulkov , Leyla Mirvakhabova , Elena Orlova , Ivan Oseledets

ProcrustesGPT: Compressing LLMs with Structured Matrices and Orthogonal Transformations

Large language models (LLMs) demonstrate impressive results in natural language processing tasks but require a significant amount of computational and memory resources. Structured matrix representations are a promising way for reducing the…

Computation and Language · Computer Science 2025-06-04 Ekaterina Grishina , Mikhail Gorbunov , Maxim Rakhuba

Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models

Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial…

Computation and Language · Computer Science 2026-04-17 Andrew Kiruluta

Lossless Compression for LLM Tensor Incremental Snapshots

During the training of Large Language Models (LLMs), tensor data is periodically "checkpointed" to persistent storage to allow recovery of work done in the event of failure. The volume of data that must be copied during each checkpoint,…

Machine Learning · Computer Science 2025-05-16 Daniel Waddington , Cornel Constantinescu

Compressing LLMs with MoP: Mixture of Pruners

The high computational demands of Large Language Models (LLMs) motivate methods that reduce parameter count and accelerate inference. In response, model pruning emerges as an effective strategy, yet current methods typically focus on a…

Machine Learning · Computer Science 2026-02-09 Bruno Lopes Yamamoto , Lucas Lauton de Alcantara , Victor Zacarias , Leandro Giusti Mugnaini , Keith Ando Ogawa , Lucas Pellicer , Rosimeire Pereira Costa , Edson Bollis , Anna Helena Reali Costa , Artur Jordao

Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation

The field of neural machine translation (NMT) has changed with the advent of large language models (LLMs). Much of the recent emphasis in natural language processing (NLP) has been on modeling machine translation and many other problems…

Computation and Language · Computer Science 2025-06-03 Yingfeng Luo , Tong Zheng , Yongyu Mu , Bei Li , Qinghong Zhang , Yongqi Gao , Ziqiang Xu , Peinan Feng , Xiaoqian Liu , Tong Xiao , Jingbo Zhu

MoDeGPT: Modular Decomposition for Large Language Model Compression

Large Language Models (LLMs) have reshaped the landscape of artificial intelligence by demonstrating exceptional performance across various tasks. However, substantial computational requirements make their deployment challenging on devices…

Machine Learning · Computer Science 2025-05-05 Chi-Heng Lin , Shangqian Gao , James Seale Smith , Abhishek Patel , Shikhar Tuli , Yilin Shen , Hongxia Jin , Yen-Chang Hsu

From Projection to Prediction: Beyond Logits for Scalable Language Models

Training Large Language Models (LLMs) typically involves a two-stage pipeline at the output layer: hidden states are projected into vocabulary logits via a linear transformation (lm_head), followed by cross-entropy loss computation against…

Machine Learning · Computer Science 2025-11-25 Jianbing Dong , Jianbin Chang

Exploring Extreme Parameter Compression for Pre-trained Language Models

Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs…

Computation and Language · Computer Science 2022-05-23 Yuxin Ren , Benyou Wang , Lifeng Shang , Xin Jiang , Qun Liu

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Large Language Models are growing in size, and we expect them to continue to do so, as larger models train quicker. However, this increase in size will severely impact inference costs. Therefore model compression is important, to retain the…

Machine Learning · Computer Science 2024-04-10 Georgy Tyukin

Mixture Compressor for Mixture-of-Experts LLMs Gains More

Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading…

Machine Learning · Computer Science 2025-02-25 Wei Huang , Yue Liao , Jianhui Liu , Ruifei He , Haoru Tan , Shiming Zhang , Hongsheng Li , Si Liu , Xiaojuan Qi

Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models

Large language models (LLMs) achieve remarkable performance through ever-increasing parameter counts, but scaling incurs steep computational costs. To better understand LLM scaling, we study representational differences between LLMs and…

Machine Learning · Computer Science 2026-05-06 Chen Liu , Xingzhi Sun , Xi Xiao , Alexandre Van Tassel , Ke Xu , Kristof Reimann , Danqi Liao , Mark Gerstein , Tianyang Wang , Xiao Wang , Smita Krishnaswamy

MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts

Large Language Models (LLMs) are predominantly deployed as dense transformers, where every parameter in every feed-forward block is activated for every token. While architecturally simple, this is computationally inefficient, since…

Machine Learning · Computer Science 2025-11-27 Ivan Novikov

CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks

Large Language Models (LLMs) such as ChatGPT and LlaMA are advancing rapidly in generative Artificial Intelligence (AI), but their immense size poses significant challenges, such as huge training and inference costs, substantial energy…

Computation and Language · Computer Science 2025-06-03 Andrei Tomut , Saeed S. Jahromi , Abhijoy Sarkar , Uygar Kurt , Sukhbinder Singh , Faysal Ishtiaq , Cesar Muñoz , Prabdeep Singh Bajaj , Ali Elborady , Gianni del Bimbo , Mehrazin Alizadeh , David Montero , Pablo Martin-Ramiro , Muhammad Ibrahim , Oussama Tahiri Alaoui , John Malcolm , Samuel Mugel , Roman Orus

Quantum Large Language Models via Tensor Network Disentanglers

We propose a method to enhance the performance of Large Language Models (LLMs) by integrating quantum computing and quantum-inspired techniques. Specifically, our approach involves replacing the weight matrices in the Self-Attention and…

Quantum Physics · Physics 2024-10-24 Borja Aizpurua , Saeed S. Jahromi , Sukhbinder Singh , Roman Orus

Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices

In recent years, Large Language Models (LLMs) through Transformer structures have dominated many machine learning tasks, especially text processing. However, these models require massive amounts of data for training and induce high resource…

Machine Learning · Computer Science 2025-04-17 Kilian Pfeiffer , Mohamed Aboelenien Ahmed , Ramin Khalili , Jörg Henkel

Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to…

Computation and Language · Computer Science 2026-02-06 Peijun Zhu , Ning Yang , Baoliang Tian , Jiayu Wei , Weihao Zhang , Haijun Zhang , Pin Lv