Related papers: Multi-scale Transformer Language Models

Revisiting the Hierarchical Multiscale LSTM

Hierarchical Multiscale LSTM (Chung et al., 2016a) is a state-of-the-art language model that learns interpretable structure from character-level input. Such models can provide fertile ground for (cognitive) computational linguistics…

Computation and Language · Computer Science 2018-07-11 Ákos Kádár , Marc-Alexandre Côté , Grzegorz Chrupała , Afra Alishahi

Multi-Scale Self-Attention for Text Classification

In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on…

Computation and Language · Computer Science 2019-12-03 Qipeng Guo , Xipeng Qiu , Pengfei Liu , Xiangyang Xue , Zheng Zhang

Emergent Stack Representations in Modeling Counter Languages Using Transformers

Transformer architectures are the backbone of most modern language models, but understanding the inner workings of these models still largely remains an open problem. One way that research in the past has tackled this problem is by…

Computation and Language · Computer Science 2025-02-04 Utkarsh Tiwari , Aviral Gupta , Michael Hahn

Learning Language-Specific Layers for Multilingual Machine Translation

Multilingual Machine Translation promises to improve translation quality between non-English languages. This is advantageous for several reasons, namely lower latency (no need to translate twice), and reduced error cascades (e.g., avoiding…

Computation and Language · Computer Science 2023-05-05 Telmo Pessoa Pires , Robin M. Schmidt , Yi-Hsiu Liao , Stephan Peitz

Pretraining with hierarchical memories: separating long-tail and common knowledge

The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a…

Computation and Language · Computer Science 2026-03-24 Hadi Pouransari , David Grangier , C Thomas , Michael Kirchhof , Oncel Tuzel

Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-based Question Answering

We introduce a novel approach to transformers that learns hierarchical representations in multiparty dialogue. First, three language modeling tasks are used to pre-train the transformers, token- and utterance-level language modeling and…

Computation and Language · Computer Science 2020-06-01 Changmao Li , Jinho D. Choi

Wide Attention Is The Way Forward For Transformers?

The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building…

Machine Learning · Computer Science 2022-11-10 Jason Ross Brown , Yiren Zhao , Ilia Shumailov , Robert D Mullins

Improving Transformer Models by Reordering their Sublayers

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with…

Computation and Language · Computer Science 2020-04-24 Ofir Press , Noah A. Smith , Omer Levy

Memory Layers at Scale

Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated…

Computation and Language · Computer Science 2024-12-23 Vincent-Pierre Berges , Barlas Oğuz , Daniel Haziza , Wen-tau Yih , Luke Zettlemoyer , Gargi Ghosh

Hierarchical Transformer for Multilingual Machine Translation

The choice of parameter sharing strategy in multilingual machine translation models determines how optimally parameter space is used and hence, directly influences ultimate translation quality. Inspired by linguistic trees that show the…

Computation and Language · Computer Science 2021-03-08 Albina Khusainova , Adil Khan , Adín Ramírez Rivera , Vitaly Romanov

Learning Multiscale Transformer Models for Sequence Generation

Multiscale feature hierarchies have been witnessed the success in the computer vision area. This further motivates researchers to design multiscale Transformer for natural language processing, mostly based on the self-attention mechanism.…

Computation and Language · Computer Science 2022-06-22 Bei Li , Tong Zheng , Yi Jing , Chengbo Jiao , Tong Xiao , Jingbo Zhu

Scaling Transformer to 1M tokens and beyond with RMT

A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer…

Computation and Language · Computer Science 2024-02-07 Aydar Bulatov , Yuri Kuratov , Yermek Kapushev , Mikhail S. Burtsev

Learning to Recall with Transformers Beyond Orthogonal Embeddings

Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability because they can encode information during training…

Machine Learning · Statistics 2026-03-18 Nuri Mert Vural , Alberto Bietti , Mahdi Soltanolkotabi , Denny Wu

Mixture of Chapters: Scaling Learnt Memory in Transformers

Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that…

Machine Learning · Computer Science 2026-03-24 Tasmay Pankaj Tibrewal , Pritish Saha , Ankit Meda , Kunal Singh , Pradeep Moturi

Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks

Recent research suggests that the feed-forward module within Transformers can be viewed as a collection of key-value memories, where the keys learn to capture specific patterns from the input based on the training examples. The values then…

Computation and Language · Computer Science 2023-10-25 Sunit Bhattacharya , Ondrej Bojar

Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models

Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks, often performing well on specific task by leveraging input-output examples. Despite their empirical success, a comprehensive…

Machine Learning · Computer Science 2025-06-03 Yifan Hao , Chenlu Ye , Chi Han , Tong Zhang

Training Trajectories of Language Models Across Scales

Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger…

Computation and Language · Computer Science 2023-05-31 Mengzhou Xia , Mikel Artetxe , Chunting Zhou , Xi Victoria Lin , Ramakanth Pasunuru , Danqi Chen , Luke Zettlemoyer , Ves Stoyanov

Probing Representations Learned by Multimodal Recurrent and Transformer Models

Recent literature shows that large-scale language modeling provides excellent reusable sentence representations with both recurrent and self-attentive architectures. However, there has been less clarity on the commonalities and differences…

Computation and Language · Computer Science 2019-08-30 Jindřich Libovický , Pranava Madhyastha

Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers

Recent studies have demonstrated that the performance of transformers on the task of language modeling obeys a power-law relationship with model size over six orders of magnitude. While transformers exhibit impressive scaling, their…

Machine Learning · Computer Science 2021-10-07 Narsimha Chilkuri , Eric Hunsberger , Aaron Voelker , Gurshaant Malik , Chris Eliasmith

Learning Spectral Methods by Transformers

Transformers demonstrate significant advantages as the building block of modern LLMs. In this work, we study the capacities of Transformers in performing unsupervised learning. We show that multi-layered Transformers, given a sufficiently…

Machine Learning · Statistics 2025-01-14 Yihan He , Yuan Cao , Hong-Yu Chen , Dennis Wu , Jianqing Fan , Han Liu