Related papers: A Single-Layer Model Can Do Language Modeling

Character-Level Language Modeling with Deeper Self-Attention

LSTMs and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their…

Computation and Language · Computer Science 2018-12-11 Rami Al-Rfou , Dokook Choe , Noah Constant , Mandy Guo , Llion Jones

Joint Prompt Optimization of Stacked LLMs using Variational Inference

Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences. Thus, they can be seen as stochastic language layers in a language network, where the learnable parameters are the…

Computation and Language · Computer Science 2023-12-05 Alessandro Sordoni , Xingdi Yuan , Marc-Alexandre Côté , Matheus Pereira , Adam Trischler , Ziang Xiao , Arian Hosseini , Friederike Niedtner , Nicolas Le Roux

Using Single Layer Networks for Discrete, Sequential Data: An Example from Natural Language Processing

A natural language parser which has been successfully implemented is described. This is a hybrid system, in which neural networks operate within a rule based framework. It can be accessed via telnet for users to try on their own text. (For…

cmp-lg · Computer Science 2008-02-03 Caroline Lyon , Ray Frank

Exploring the Limits of Language Modeling

In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding. We extend current models to deal with two key challenges present in this task: corpora and…

Computation and Language · Computer Science 2016-02-15 Rafal Jozefowicz , Oriol Vinyals , Mike Schuster , Noam Shazeer , Yonghui Wu

A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases

We investigate the geometry of predictive information across the layers of large language models (LLMs). We repurpose representation lenses-learned affine maps trained to predict the next token from intermediate residual streams-as…

Machine Learning · Computer Science 2026-05-12 Gianfranco Lombardo , Giuseppe Trimigno , Stefano Cagnoni

Language Modeling with Gated Convolutional Networks

The pre-dominant approach to language modeling to date is based on recurrent neural networks. Their success on this task is often linked to their ability to capture unbounded context. In this paper we develop a finite context approach…

Computation and Language · Computer Science 2017-09-12 Yann N. Dauphin , Angela Fan , Michael Auli , David Grangier

Sequential Recurrent Neural Networks for Language Modeling

Feedforward Neural Network (FNN)-based language models estimate the probability of the next word based on the history of the last N words, whereas Recurrent Neural Networks (RNN) perform the same task based only on the last word and some…

Computation and Language · Computer Science 2017-03-24 Youssef Oualil , Clayton Greenberg , Mittul Singh , Dietrich Klakow

Revisiting Simple Neural Probabilistic Language Models

Recent progress in language modeling has been driven not only by advances in neural architectures, but also through hardware and optimization improvements. In this paper, we revisit the neural probabilistic language model (NPLM)…

Computation and Language · Computer Science 2021-04-09 Simeng Sun , Mohit Iyyer

Language Modeling through Long Term Memory Network

Recurrent Neural Networks (RNN), Long Short-Term Memory Networks (LSTM), and Memory Networks which contain memory are popularly used to learn patterns in sequential data. Sequential data has long sequences that hold relationships. RNN can…

Computation and Language · Computer Science 2019-04-22 Anupiya Nugaliyadde , Kok Wai Wong , Ferdous Sohel , Hong Xie

Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation

In deep neural network modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in order to obtain high-quality continuous space representations which in turn improves the quality of the…

Computation and Language · Computer Science 2021-06-21 Raj Dabre , Atsushi Fujita

Improving Language Modeling using Densely Connected Recurrent Neural Networks

In this paper, we introduce the novel concept of densely connected layers into recurrent neural networks. We evaluate our proposed architecture on the Penn Treebank language modeling task. We show that we can obtain similar perplexity…

Computation and Language · Computer Science 2017-07-20 Fréderic Godin , Joni Dambre , Wesley De Neve

Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models

In spite of their huge success, transformer models remain difficult to scale in depth. In this work, we develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the…

Computation and Language · Computer Science 2024-07-19 Akhil Kedia , Mohd Abbas Zaidi , Sushil Khyalia , Jungho Jung , Harshith Goka , Haejun Lee

Very Deep Convolutional Networks for Text Classification

The dominant approach for many NLP tasks are recurrent neural networks, in particular LSTMs, and convolutional neural networks. However, these architectures are rather shallow in comparison to the deep convolutional networks which have…

Computation and Language · Computer Science 2017-01-30 Alexis Conneau , Holger Schwenk , Loïc Barrault , Yann Lecun

On the Convergence Rate of Training Recurrent Neural Networks

How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory…

Machine Learning · Computer Science 2019-05-28 Zeyuan Allen-Zhu , Yuanzhi Li , Zhao Song

M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

Transformers are highly parallel but are limited to computations in the TC$^0$ complexity class, excluding tasks such as entity tracking and code execution that provably require greater expressive power. Motivated by this limitation, we…

Machine Learning · Computer Science 2026-05-15 Mayank Mishra , Shawn Tan , Ion Stoica , Joseph Gonzalez , Tri Dao

Deep LSTM for Large Vocabulary Continuous Speech Recognition

Recurrent neural networks (RNNs), especially long short-term memory (LSTM) RNNs, are effective network for sequential task like speech recognition. Deeper LSTM models perform well on large vocabulary continuous speech recognition, because…

Computation and Language · Computer Science 2017-03-22 Xu Tian , Jun Zhang , Zejun Ma , Yi He , Juan Wei , Peihao Wu , Wenchang Situ , Shuai Li , Yang Zhang

A Neural Network Approach for Mixing Language Models

The performance of Neural Network (NN)-based language models is steadily improving due to the emergence of new architectures, which are able to learn different natural language characteristics. This paper presents a novel framework, which…

Computation and Language · Computer Science 2017-08-24 Youssef Oualil , Dietrich Klakow

Pushdown Layers: Encoding Recursive Structure in Transformer Language Models

Recursion is a prominent feature of human language, and fundamentally challenging for self-attention due to the lack of an explicit recursive-state tracking mechanism. Consequently, Transformer language models poorly capture long-tail…

Computation and Language · Computer Science 2023-10-31 Shikhar Murty , Pratyusha Sharma , Jacob Andreas , Christopher D. Manning

Layer by Layer: Uncovering Hidden Representations in Language Models

From extracting features to generating text, the outputs of large language models (LLMs) typically rely on the final layers, following the conventional wisdom that earlier layers capture only low-level cues. However, our analysis shows that…

Machine Learning · Computer Science 2025-06-17 Oscar Skean , Md Rifat Arefin , Dan Zhao , Niket Patel , Jalal Naghiyev , Yann LeCun , Ravid Shwartz-Ziv

Recurrent Stacking of Layers for Compact Neural Machine Translation Models

In neural machine translation (NMT), the most common practice is to stack a number of recurrent or feed-forward layers in the encoder and the decoder. As a result, the addition of each new layer improves the translation quality…

Computation and Language · Computer Science 2018-07-18 Raj Dabre , Atsushi Fujita