Related papers: IOT: Instance-wise Layer Reordering for Transforme…

Improving Transformer Models by Reordering their Sublayers

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with…

Computation and Language · Computer Science 2020-04-24 Ofir Press , Noah A. Smith , Omer Levy

Intra-Layer Recurrence in Transformers for Language Modeling

Transformer models have established new benchmarks in natural language processing; however, their increasing depth results in substantial growth in parameter counts. While existing recurrent transformer methods address this issue by…

Computation and Language · Computer Science 2025-05-27 Anthony Nguyen , Wenjun Lin

How Well Can Transformers Emulate In-context Newton's Method?

Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization…

Machine Learning · Computer Science 2024-03-06 Angeliki Giannou , Liu Yang , Tianhao Wang , Dimitris Papailiopoulos , Jason D. Lee

Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression

Transformers excel at in-context learning (ICL) -- learning from demonstrations without parameter updates -- but how they do so remains a mystery. Recent work suggests that Transformers may internally run Gradient Descent (GD), a…

Machine Learning · Computer Science 2024-11-19 Deqing Fu , Tian-Qi Chen , Robin Jia , Vatsal Sharan

Explicit Reordering for Neural Machine Translation

In Transformer-based neural machine translation (NMT), the positional encoding mechanism helps the self-attention networks to learn the source representation with order dependency, which makes the Transformer-based NMT achieve…

Computation and Language · Computer Science 2020-04-09 Kehai Chen , Rui Wang , Masao Utiyama , Eiichiro Sumita

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order

Due to their architecture and how they are trained, artificial neural networks are typically not robust toward pruning or shuffling layers at test time. However, such properties would be desirable for different applications, such as…

Computer Vision and Pattern Recognition · Computer Science 2024-12-09 Matthias Freiberger , Peter Kun , Anders Sundnes Løvlie , Sebastian Risi

Linear Transformers are Versatile In-Context Learners

Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in…

Machine Learning · Computer Science 2024-10-31 Max Vladymyrov , Johannes von Oswald , Mark Sandler , Rong Ge

Transformers for Supervised Online Continual Learning

Transformers have become the dominant architecture for sequence modeling tasks such as natural language processing or audio processing, and they are now even considered for tasks that are not naturally sequential such as image…

Machine Learning · Computer Science 2024-03-05 Jorg Bornschein , Yazhe Li , Amal Rannen-Triki

Position Information in Transformers: An Overview

Transformers are arguably the main workhorse in recent Natural Language Processing research. By definition a Transformer is invariant with respect to reordering of the input. However, language is inherently sequential and word order is…

Computation and Language · Computer Science 2021-09-10 Philipp Dufter , Martin Schmitt , Hinrich Schütze

Wide Attention Is The Way Forward For Transformers?

The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building…

Machine Learning · Computer Science 2022-11-10 Jason Ross Brown , Yiren Zhao , Ilia Shumailov , Robert D Mullins

Less is More: Pay Less Attention in Vision Transformers

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works…

Computer Vision and Pattern Recognition · Computer Science 2021-12-24 Zizheng Pan , Bohan Zhuang , Haoyu He , Jing Liu , Jianfei Cai

A Neural ODE Interpretation of Transformer Layers

Transformer layers, which use an alternating pattern of multi-head attention and multi-layer perceptron (MLP) layers, provide an effective tool for a variety of machine learning problems. As the transformer layers use residual connections…

Machine Learning · Computer Science 2022-12-13 Yaofeng Desmond Zhong , Tongtao Zhang , Amit Chakraborty , Biswadip Dey

REOrdering Patches Improves Vision Models

Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is…

Machine Learning · Computer Science 2025-10-24 Declan Kutscher , David M. Chan , Yutong Bai , Trevor Darrell , Ritwik Gupta

Transformers predicting the future. Applying attention in next-frame and time series forecasting

Recurrent Neural Networks were, until recently, one of the best ways to capture the timely dependencies in sequences. However, with the introduction of the Transformer, it has been proven that an architecture with only attention-mechanisms…

Machine Learning · Computer Science 2021-08-19 Radostin Cholakov , Todor Kolev

InstaFormer: Instance-Aware Image-to-Image Translation with Transformer

We present a novel Transformer-based network architecture for instance-aware image-to-image translation, dubbed InstaFormer, to effectively integrate global- and instance-level information. By considering extracted content features from an…

Computer Vision and Pattern Recognition · Computer Science 2022-03-31 Soohyun Kim , Jongbeom Baek , Jihye Park , Gyeongnyeon Kim , Seungryong Kim

Layer-Parallel Training for Transformers

We present a new training methodology for transformers using a multilevel, layer-parallel approach. Through a neural ODE formulation of transformers, our application of a multilevel parallel-in-time algorithm for the forward and…

Machine Learning · Computer Science 2026-01-27 Shuai Jiang , Marc Salvadó-Benasco , Eric C. Cyr , Alena Kopaničáková , Rolf Krause , Jacob B. Schroder

Explicitly Modeling the Discriminability for Instance-Aware Visual Object Tracking

Visual object tracking performance has been dramatically improved in recent years, but some severe challenges remain open, like distractors and occlusions. We suspect the reason is that the feature representations of the tracking targets…

Computer Vision and Pattern Recognition · Computer Science 2021-10-29 Mengmeng Wang , Xiaoqian Yang , Yong Liu

IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method

High-order numerical methods enhance Transformer performance in tasks like NLP and CV, but introduce a performance-efficiency trade-off due to increased computational overhead. Our analysis reveals that conventional efficiency techniques,…

Machine Learning · Computer Science 2025-10-14 Xinyu Liu , Bei Li , Jiahao Liu , Junhao Ruan , Kechen Jiao , Hongyin Tang , Jingang Wang , Xiao Tong , Jingbo Zhu

LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction

Transformer encoders contextualize token representations by attending to all other tokens at each layer, leading to quadratic increase in compute effort with the input length. In practice, however, the input text of many NLP tasks can be…

Computation and Language · Computer Science 2023-06-01 Jeremiah Milbauer , Annie Louis , Mohammad Javad Hosseini , Alex Fabrikant , Donald Metzler , Tal Schuster