Related papers: Understanding Transformer Encoder-Decoder Represen…

Learning from Binary Multiway Data: Probabilistic Tensor Decomposition and its Statistical Optimality

We consider the problem of decomposing a higher-order tensor with binary entries. Such data problems arise frequently in applications such as neuroimaging, recommendation system, topic modeling, and sensor network localization. We propose a…

Machine Learning · Statistics 2020-09-22 Miaoyan Wang , Lexin Li

Adaptive Network Sparsification with Dependent Variational Beta-Bernoulli Dropout

While variational dropout approaches have been shown to be effective for network sparsification, they are still suboptimal in the sense that they set the dropout rate for each neuron without consideration of the input data. With such…

Machine Learning · Statistics 2019-03-05 Juho Lee , Saehoon Kim , Jaehong Yoon , Hae Beom Lee , Eunho Yang , Sung Ju Hwang

Structured Multidimensional Representation Learning for Large Language Models

Transformer architectures achieve state-of-the-art performance across a wide range of pattern recognition and natural language processing tasks, but their scaling is accompanied by substantial parameter growth and redundancy in the…

Computation and Language · Computer Science 2026-03-09 Alaa El Ichi , Khalide Jbilou , Mohamed El Guide , Franck Dufrenois

Reducing Transformer Depth on Demand with Structured Dropout

Overparameterized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, language modeling, and question answering. These models contain hundreds of millions of…

Machine Learning · Computer Science 2019-09-26 Angela Fan , Edouard Grave , Armand Joulin

Entroformer: A Transformer-based Entropy Model for Learned Image Compression

One critical component in lossy deep image compression is the entropy model, which predicts the probability distribution of the quantized latent representation in the encoding and decoding modules. Previous works build entropy models upon…

Image and Video Processing · Electrical Eng. & Systems 2023-03-16 Yichen Qian , Ming Lin , Xiuyu Sun , Zhiyu Tan , Rong Jin

Estimating Probability Densities with Transformer and Denoising Diffusion

Transformers are often the go-to architecture to build foundation models that ingest a large amount of training data. But these models do not estimate the probability density distribution when trained on regression problems, yet obtaining…

Machine Learning · Computer Science 2024-07-23 Henry W. Leung , Jo Bovy , Joshua S. Speagle

Turbo Decoding on the Binary Erasure Channel: Finite-Length Analysis and Turbo Stopping Sets

This paper is devoted to the finite-length analysis of turbo decoding over the binary erasure channel (BEC). The performance of iterative belief-propagation (BP) decoding of low-density parity-check (LDPC) codes over the BEC can be…

Information Theory · Computer Science 2011-05-31 Eirik Rosnes , Øyvind Ytrehus

Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation

Predicting the altered acoustic frames is an effective way of self-supervised learning for speech representation. However, it is challenging to prevent the pretrained model from overfitting. In this paper, we proposed to introduce two…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-12 Jian Luo , Jianzong Wang , Ning Cheng , Jing Xiao

Revisiting Transformers through the Lens of Low Entropy and Dynamic Sparsity

Compression has been a critical lens to understand the success of Transformers. In the past, we have typically taken the target distribution as a criterion to evaluate a model's compression performance. Nevertheless,it often remains…

Machine Learning · Computer Science 2025-04-29 Ruifeng Ren , Yong Liu

Learnable Bernoulli Dropout for Bayesian Deep Learning

In this work, we propose learnable Bernoulli dropout (LBD), a new model-agnostic dropout scheme that considers the dropout rates as parameters jointly optimized with other model parameters. By probabilistic modeling of Bernoulli dropout,…

Machine Learning · Computer Science 2020-02-13 Shahin Boluki , Randy Ardywibowo , Siamak Zamani Dadaneh , Mingyuan Zhou , Xiaoning Qian

Cross-Domain Lossy Compression via Constrained Minimum Entropy Coupling

This paper studies cross-domain lossy compression through the lens of minimum entropy coupling (MEC) with rate and classification constraints. In this setting, an encoder observes samples from a degraded source domain, while the decoder is…

Information Theory · Computer Science 2026-05-12 Nam Nguyen , Hassan Tavakoli , An Vuong , Thinh Nguyen , Bella Bose

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers

Due to its effectiveness and performance, the Transformer translation model has attracted wide attention, most recently in terms of probing-based approaches. Previous work focuses on using or probing source linguistic features in the…

Computation and Language · Computer Science 2021-04-21 Hongfei Xu , Josef van Genabith , Qiuhui Liu , Deyi Xiong

Robustly representing uncertainty in deep neural networks through sampling

As deep neural networks (DNNs) are applied to increasingly challenging problems, they will need to be able to represent their own uncertainty. Modeling uncertainty is one of the key features of Bayesian methods. Using Bernoulli dropout with…

Machine Learning · Computer Science 2019-09-19 Patrick McClure , Nikolaus Kriegeskorte

UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost

Transformer architecture achieves great success in abundant natural language processing tasks. The over-parameterization of the Transformer model has motivated plenty of works to alleviate its overfitting for superior performances. With…

Computation and Language · Computer Science 2021-04-13 Zhen Wu , Lijun Wu , Qi Meng , Yingce Xia , Shufang Xie , Tao Qin , Xinyu Dai , Tie-Yan Liu

Recurrent multiple shared layers in Depth for Neural Machine Translation

Learning deeper models is usually a simple and effective approach to improve model performance, but deeper models have larger model parameters and are more difficult to train. To get a deeper model, simply stacking more layers of the model…

Computation and Language · Computer Science 2021-08-27 GuoLiang Li , Yiyang Li

Adaptive Transformers for Learning Multimodal Representations

The usage of transformers has grown from learning about language semantics to forming meaningful visiolinguistic representations. These architectures are often over-parametrized, requiring large amounts of computation. In this work, we…

Computation and Language · Computer Science 2020-07-09 Prajjwal Bhargava

A Lower Bound on the Error Exponent of Linear Block Codes over the Erasure Channel

A lower bound on the maximum likelihood (ML) decoding error exponent of linear block code ensembles, on the erasure channel, is developed. The lower bound turns to be positive, over an ensemble specific interval of erasure probabilities,…

Information Theory · Computer Science 2019-01-23 Enrico Paolini , Gianluigi Liva

Optimal Remote Estimation Over Use-Dependent Packet-Drop Channels - Extended Version

Consider a discrete-time remote estimation system formed by an encoder, a transmission policy, a channel, and a remote estimator. The encoder assesses a random process that the remote estimator seeks to estimate based on information sent to…

Systems and Control · Computer Science 2016-05-04 David Ward , Nuno C. Martins

Constant Composition Distribution Matching

Distribution matching transforms independent and Bernoulli(1/2) distributed input bits into a sequence of output symbols with a desired distribution. Fixed-to-fixed length, invertible, and low complexity encoders and decoders based on…

Information Theory · Computer Science 2015-03-18 Patrick Schulte , Georg Böcherer

Function Contrastive Learning of Transferable Meta-Representations

Meta-learning algorithms adapt quickly to new tasks that are drawn from the same task distribution as the training tasks. The mechanism leading to fast adaptation is the conditioning of a downstream predictive model on the inferred…

Machine Learning · Computer Science 2021-07-23 Muhammad Waleed Gondal , Shruti Joshi , Nasim Rahaman , Stefan Bauer , Manuel Wüthrich , Bernhard Schölkopf