Related papers: Efficient GPT Model Pre-training using Tensor Trai…

TensorGPT: Efficient Compression of Large Language Models based on Tensor-Train Decomposition

High-dimensional token embeddings underpin Large Language Models (LLMs), as they can capture subtle semantic information and significantly enhance the modelling of complex language patterns. However, this high dimensionality also introduces…

Computation and Language · Computer Science 2024-10-07 Mingxue Xu , Yao Lei Xu , Danilo P. Mandic

bert2BERT: Towards Reusable Pretrained Language Models

In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources and most of the models are trained from…

Computation and Language · Computer Science 2021-10-15 Cheng Chen , Yichun Yin , Lifeng Shang , Xin Jiang , Yujia Qin , Fengyu Wang , Zhi Wang , Xiao Chen , Zhiyuan Liu , Qun Liu

Tensorized Embedding Layers for Efficient Model Compression

The embedding layers transforming input words into real vectors are the key components of deep neural networks used in natural language processing. However, when the vocabulary is large, the corresponding weight matrices can be enormous,…

Computation and Language · Computer Science 2020-02-20 Oleksii Hrinchuk , Valentin Khrulkov , Leyla Mirvakhabova , Elena Orlova , Ivan Oseledets

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory…

Computation and Language · Computer Science 2020-03-17 Mohammad Shoeybi , Mostofa Patwary , Raul Puri , Patrick LeGresley , Jared Casper , Bryan Catanzaro

Exploring Extreme Parameter Compression for Pre-trained Language Models

Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs…

Computation and Language · Computer Science 2022-05-23 Yuxin Ren , Benyou Wang , Lifeng Shang , Xin Jiang , Qun Liu

TQCompressor: improving tensor decomposition methods in neural networks via permutations

We introduce TQCompressor, a novel method for neural network model compression with improved tensor decompositions. We explore the challenges posed by the computational and storage demands of pre-trained language models in NLP tasks and…

Machine Learning · Computer Science 2024-01-30 V. Abronin , A. Naumov , D. Mazur , D. Bystrov , K. Tsarova , Ar. Melnikov , I. Oseledets , S. Dolgov , R. Brasher , M. Perelshtein

Investigating Pre-trained Language Models on Cross-Domain Datasets, a Step Closer to General AI

Pre-trained language models have recently emerged as a powerful tool for fine-tuning a variety of language tasks. Ideally, when models are pre-trained on large amount of data, they are expected to gain implicit knowledge. In this paper, we…

Computation and Language · Computer Science 2023-06-22 Mohamad Ballout , Ulf Krumnack , Gunther Heidemann , Kai-Uwe Kühnberger

Training Tips for the Transformer Model

This article describes our experiments in neural machine translation using the recent Tensor2Tensor framework and the Transformer sequence-to-sequence model (Vaswani et al., 2017). We examine some of the critical parameters that affect the…

Computation and Language · Computer Science 2018-05-03 Martin Popel , Ondřej Bojar

Parameter-Efficient Transformer Embeddings

Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which…

Computation and Language · Computer Science 2025-05-06 Henry Ndubuaku , Mouad Talhi

Benchmarking down-scaled (not so large) pre-trained language models

Large Transformer-based language models are pre-trained on corpora of varying sizes, for a different number of steps and with different batch sizes. At the same time, more fundamental components, such as the pre-training objective or…

Computation and Language · Computer Science 2021-05-12 M. Aßenmacher , P. Schulze , C. Heumann

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

Fine-tuning a pretrained transformer for a downstream task has become a standard method in NLP in the last few years. While the results from these models are impressive, applying them can be extremely computationally expensive, as is…

Computation and Language · Computer Science 2020-08-18 Davis Yoshida , Allyson Ettinger , Kevin Gimpel

A Comparative Analysis of Distributed Training Strategies for GPT-2

The rapid advancement in Large Language Models has been met with significant challenges in their training processes, primarily due to their considerable computational and memory demands. This research examines parallelization techniques…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-27 Ishan Patwardhan , Shubham Gandhi , Om Khare , Amit Joshi , Suraj Sawant

Jump to Conclusions: Short-Cutting Transformers With Linear Transformations

Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its…

Computation and Language · Computer Science 2024-06-21 Alexander Yom Din , Taelin Karidi , Leshem Choshen , Mor Geva

Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture

In this paper, we propose a highly parameter-efficient approach to scaling pre-trained language models (PLMs) to a deeper model depth. Unlike prior work that shares all parameters or uses extra blocks, we design a more capable…

Computation and Language · Computer Science 2023-04-12 Peiyu Liu , Ze-Feng Gao , Yushuo Chen , Wayne Xin Zhao , Ji-Rong Wen

Primer: Searching for Efficient Transformers for Language Modeling

Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs…

Machine Learning · Computer Science 2022-01-26 David R. So , Wojciech Mańke , Hanxiao Liu , Zihang Dai , Noam Shazeer , Quoc V. Le

Kronecker Decomposition for GPT Compression

GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain due to its state-of-the-art performance in several downstream tasks. The success…

Computation and Language · Computer Science 2021-10-18 Ali Edalati , Marzieh Tahaei , Ahmad Rashid , Vahid Partovi Nia , James J. Clark , Mehdi Rezagholizadeh

Timer: Generative Pre-trained Transformers Are Large Time Series Models

Deep learning has contributed remarkably to the advancement of time series analysis. Still, deep models can encounter performance bottlenecks in real-world data-scarce scenarios, which can be concealed due to the performance saturation with…

Machine Learning · Computer Science 2024-10-21 Yong Liu , Haoran Zhang , Chenyu Li , Xiangdong Huang , Jianmin Wang , Mingsheng Long

Teaching Arithmetic to Small Transformers

Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token…

Machine Learning · Computer Science 2023-07-10 Nayoung Lee , Kartik Sreenivasan , Jason D. Lee , Kangwook Lee , Dimitris Papailiopoulos

Tensor-Train Long Short-Term Memory for Monaural Speech Enhancement

In recent years, Long Short-Term Memory (LSTM) has become a popular choice for speech separation and speech enhancement task. The capability of LSTM network can be enhanced by widening and adding more layers. However, this would introduce…

Sound · Computer Science 2018-12-27 Suman Samui , Indrajit Chakrabarti , Soumya K. Ghosh

FPM: A Collection of Large-scale Foundation Pre-trained Language Models

Large-scale Transformer models have significantly promoted the recent development of natural language processing applications. However, little effort has been made to unify the effective models. In this paper, driven by providing a new set…

Computation and Language · Computer Science 2022-04-12 Dezhou Shen