Related papers: Foundation Transformers

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and…

Computer Vision and Pattern Recognition · Computer Science 2022-09-01 Wenhui Wang , Hangbo Bao , Li Dong , Johan Bjorck , Zhiliang Peng , Qiang Liu , Kriti Aggarwal , Owais Khan Mohammed , Saksham Singhal , Subhojit Som , Furu Wei

Can bidirectional encoder become the ultimate winner for downstream applications of foundation models?

Over the past few decades, Artificial Intelligence(AI) has progressed from the initial machine learning stage to the deep learning stage, and now to the stage of foundational models. Foundational models have the characteristics of…

Computation and Language · Computer Science 2024-11-28 Lewen Yang , Xuanyu Zhou , Juao Fan , Xinyi Xie , Shengxin Zhu

FPM: A Collection of Large-scale Foundation Pre-trained Language Models

Large-scale Transformer models have significantly promoted the recent development of natural language processing applications. However, little effort has been made to unify the effective models. In this paper, driven by providing a new set…

Computation and Language · Computer Science 2022-04-12 Dezhou Shen

An Empirical Study on the Transferability of Transformer Modules in Parameter-Efficient Fine-Tuning

Parameter-efficient fine-tuning approaches have recently garnered a lot of attention. Having considerably lower number of trainable weights, these methods can bring about scalability and computational effectiveness. In this paper, we look…

Computation and Language · Computer Science 2023-02-23 Mohammad Akbar-Tajari , Sara Rajaee , Mohammad Taher Pilehvar

A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT

Pretrained Foundation Models (PFMs) are regarded as the foundation for various downstream tasks with different data modalities. A PFM (e.g., BERT, ChatGPT, and GPT-4) is trained on large-scale data which provides a reasonable parameter…

Artificial Intelligence · Computer Science 2023-05-02 Ce Zhou , Qian Li , Chen Li , Jun Yu , Yixin Liu , Guangjing Wang , Kai Zhang , Cheng Ji , Qiben Yan , Lifang He , Hao Peng , Jianxin Li , Jia Wu , Ziwei Liu , Pengtao Xie , Caiming Xiong , Jian Pei , Philip S. Yu , Lichao Sun

A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation

Large language models such as BERT and the GPT series started a paradigm shift that calls for building general-purpose models via pre-training on large datasets, followed by fine-tuning on task-specific datasets. There is now a plethora of…

Computation and Language · Computer Science 2023-06-13 Jeremy Gwinnup , Kevin Duh

Foundation Models Secretly Understand Neural Network Weights: Enhancing Hypernetwork Architectures with Foundation Models

Large pre-trained models, or foundation models, have shown impressive performance when adapted to a variety of downstream tasks, often out-performing specialized models. Hypernetworks, neural networks that generate some or all of the…

Machine Learning · Computer Science 2025-03-04 Jeffrey Gu , Serena Yeung-Levy

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

In this paper, we explore the possibility of building a unified foundation model that can be adapted to both vision-only and text-only tasks. Starting from BERT and ViT, we design a unified transformer consisting of modality-specific…

Computer Vision and Pattern Recognition · Computer Science 2021-12-15 Qing Li , Boqing Gong , Yin Cui , Dan Kondratyuk , Xianzhi Du , Ming-Hsuan Yang , Matthew Brown

Efficient Transformers: A Survey

Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example,…

Machine Learning · Computer Science 2022-03-15 Yi Tay , Mostafa Dehghani , Dara Bahri , Donald Metzler

Improving Transformer Models by Reordering their Sublayers

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with…

Computation and Language · Computer Science 2020-04-24 Ofir Press , Noah A. Smith , Omer Levy

RealFormer: Transformer Likes Residual Attention

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its…

Machine Learning · Computer Science 2021-09-14 Ruining He , Anirudh Ravula , Bhargav Kanagal , Joshua Ainslie

Go Wider Instead of Deeper

More transformer blocks with residual connections have recently achieved impressive results on various tasks. To achieve better performance with fewer trainable parameters, recent methods are proposed to go shallower by parameter sharing or…

Machine Learning · Computer Science 2021-09-08 Fuzhao Xue , Ziji Shi , Futao Wei , Yuxuan Lou , Yong Liu , Yang You

Talk Like a Packet: Rethinking Network Traffic Analysis with Transformer Foundation Models

Inspired by the success of Transformer-based models in natural language processing, this paper investigates their potential as foundation models for network traffic analysis. We propose a unified pre-training and fine-tuning pipeline for…

Networking and Internet Architecture · Computer Science 2026-02-09 Samara Mayhoub , Chuan Heng Foh , Mahdi Boloursaz Mashhadi , Mohammad Shojafar , Rahim Tafazolli

Robust Transfer Learning with Pretrained Language Models through Adapters

Transfer learning with large pretrained transformer-based language models like BERT has become a dominating approach for most NLP tasks. Simply fine-tuning those large language models on downstream tasks or combining it with task-specific…

Computation and Language · Computer Science 2021-08-06 Wenjuan Han , Bo Pang , Yingnian Wu

From Prediction to Understanding: Will AI Foundation Models Transform Brain Science?

Generative pretraining (the "GPT" in ChatGPT) enables language models to learn from vast amounts of internet text without human supervision. This approach has driven breakthroughs across AI by allowing deep neural networks to learn from…

Neurons and Cognition · Quantitative Biology 2025-09-23 Thomas Serre , Ellie Pavlick

GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures

Attention based language models have become a critical component in state-of-the-art natural language processing systems. However, these models have significant computational requirements, due to long training times, dense operations and…

Computation and Language · Computer Science 2021-06-11 Ivan Chelombiev , Daniel Justus , Douglas Orr , Anastasia Dietrich , Frithjof Gressmann , Alexandros Koliousis , Carlo Luschi

Transformers: "The End of History" for NLP?

Recent advances in neural architectures, such as the Transformer, coupled with the emergence of large-scale pre-trained models such as BERT, have revolutionized the field of Natural Language Processing (NLP), pushing the state of the art…

Computation and Language · Computer Science 2021-09-24 Anton Chernyavskiy , Dmitry Ilvovsky , Preslav Nakov

Advancements in Natural Language Processing: Exploring Transformer-Based Architectures for Text Understanding

Natural Language Processing (NLP) has witnessed a transformative leap with the advent of transformer-based architectures, which have significantly enhanced the ability of machines to understand and generate human-like text. This paper…

Computation and Language · Computer Science 2025-03-27 Tianhao Wu , Yu Wang , Ngoc Quach

Benchmarking down-scaled (not so large) pre-trained language models

Large Transformer-based language models are pre-trained on corpora of varying sizes, for a different number of steps and with different batch sizes. At the same time, more fundamental components, such as the pre-training objective or…

Computation and Language · Computer Science 2021-05-12 M. Aßenmacher , P. Schulze , C. Heumann

Layer-Wise Evolution of Representations in Fine-Tuned Transformers: Insights from Sparse AutoEncoders

Fine-tuning pre-trained transformers is a powerful technique for enhancing the performance of base models on specific tasks. From early applications in models like BERT to fine-tuning Large Language Models (LLMs), this approach has been…

Computation and Language · Computer Science 2025-02-25 Suneel Nadipalli