English
Related papers

Related papers: Multi-scale Transformer Language Models

200 papers

Hierarchical Multiscale LSTM (Chung et al., 2016a) is a state-of-the-art language model that learns interpretable structure from character-level input. Such models can provide fertile ground for (cognitive) computational linguistics…

Computation and Language · Computer Science 2018-07-11 Ákos Kádár , Marc-Alexandre Côté , Grzegorz Chrupała , Afra Alishahi

In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on…

Computation and Language · Computer Science 2019-12-03 Qipeng Guo , Xipeng Qiu , Pengfei Liu , Xiangyang Xue , Zheng Zhang

Transformer architectures are the backbone of most modern language models, but understanding the inner workings of these models still largely remains an open problem. One way that research in the past has tackled this problem is by…

Computation and Language · Computer Science 2025-02-04 Utkarsh Tiwari , Aviral Gupta , Michael Hahn

Multilingual Machine Translation promises to improve translation quality between non-English languages. This is advantageous for several reasons, namely lower latency (no need to translate twice), and reduced error cascades (e.g., avoiding…

Computation and Language · Computer Science 2023-05-05 Telmo Pessoa Pires , Robin M. Schmidt , Yi-Hsiu Liao , Stephan Peitz

The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a…

Computation and Language · Computer Science 2026-03-24 Hadi Pouransari , David Grangier , C Thomas , Michael Kirchhof , Oncel Tuzel

We introduce a novel approach to transformers that learns hierarchical representations in multiparty dialogue. First, three language modeling tasks are used to pre-train the transformers, token- and utterance-level language modeling and…

Computation and Language · Computer Science 2020-06-01 Changmao Li , Jinho D. Choi

The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building…

Machine Learning · Computer Science 2022-11-10 Jason Ross Brown , Yiren Zhao , Ilia Shumailov , Robert D Mullins

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with…

Computation and Language · Computer Science 2020-04-24 Ofir Press , Noah A. Smith , Omer Levy

Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated…

Computation and Language · Computer Science 2024-12-23 Vincent-Pierre Berges , Barlas Oğuz , Daniel Haziza , Wen-tau Yih , Luke Zettlemoyer , Gargi Ghosh

The choice of parameter sharing strategy in multilingual machine translation models determines how optimally parameter space is used and hence, directly influences ultimate translation quality. Inspired by linguistic trees that show the…

Computation and Language · Computer Science 2021-03-08 Albina Khusainova , Adil Khan , Adín Ramírez Rivera , Vitaly Romanov

Multiscale feature hierarchies have been witnessed the success in the computer vision area. This further motivates researchers to design multiscale Transformer for natural language processing, mostly based on the self-attention mechanism.…

Computation and Language · Computer Science 2022-06-22 Bei Li , Tong Zheng , Yi Jing , Chengbo Jiao , Tong Xiao , Jingbo Zhu

A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer…

Computation and Language · Computer Science 2024-02-07 Aydar Bulatov , Yuri Kuratov , Yermek Kapushev , Mikhail S. Burtsev

Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability because they can encode information during training…

Machine Learning · Statistics 2026-03-18 Nuri Mert Vural , Alberto Bietti , Mahdi Soltanolkotabi , Denny Wu

Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that…

Machine Learning · Computer Science 2026-03-24 Tasmay Pankaj Tibrewal , Pritish Saha , Ankit Meda , Kunal Singh , Pradeep Moturi

Recent research suggests that the feed-forward module within Transformers can be viewed as a collection of key-value memories, where the keys learn to capture specific patterns from the input based on the training examples. The values then…

Computation and Language · Computer Science 2023-10-25 Sunit Bhattacharya , Ondrej Bojar

Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks, often performing well on specific task by leveraging input-output examples. Despite their empirical success, a comprehensive…

Machine Learning · Computer Science 2025-06-03 Yifan Hao , Chenlu Ye , Chi Han , Tong Zhang

Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger…

Computation and Language · Computer Science 2023-05-31 Mengzhou Xia , Mikel Artetxe , Chunting Zhou , Xi Victoria Lin , Ramakanth Pasunuru , Danqi Chen , Luke Zettlemoyer , Ves Stoyanov

Recent literature shows that large-scale language modeling provides excellent reusable sentence representations with both recurrent and self-attentive architectures. However, there has been less clarity on the commonalities and differences…

Computation and Language · Computer Science 2019-08-30 Jindřich Libovický , Pranava Madhyastha

Recent studies have demonstrated that the performance of transformers on the task of language modeling obeys a power-law relationship with model size over six orders of magnitude. While transformers exhibit impressive scaling, their…

Machine Learning · Computer Science 2021-10-07 Narsimha Chilkuri , Eric Hunsberger , Aaron Voelker , Gurshaant Malik , Chris Eliasmith

Transformers demonstrate significant advantages as the building block of modern LLMs. In this work, we study the capacities of Transformers in performing unsupervised learning. We show that multi-layered Transformers, given a sufficiently…

Machine Learning · Statistics 2025-01-14 Yihan He , Yuan Cao , Hong-Yu Chen , Dennis Wu , Jianqing Fan , Han Liu
‹ Prev 1 2 3 10 Next ›