Related papers: Language Modelling with Pixels

Pixology: Probing the Linguistic and Visual Capabilities of Pixel-based Language Models

Pixel-based language models have emerged as a compelling alternative to subword-based language modelling, particularly because they can represent virtually any script. PIXEL, a canonical example of such a model, is a vision transformer that…

Computation and Language · Computer Science 2024-10-17 Kushal Tatariya , Vladimir Araujo , Thomas Bauwens , Miryam de Lhoneux

Multilingual Pretraining for Pixel Language Models

Pixel language models operate directly on images of rendered text, eliminating the need for a fixed vocabulary. While these models have demonstrated strong capabilities for downstream cross-lingual transfer, multilingual pretraining remains…

Computation and Language · Computer Science 2025-12-03 Ilker Kesen , Jonas F. Lotz , Ingo Ziegler , Phillip Rust , Desmond Elliott

Text Rendering Strategies for Pixel Language Models

Pixel-based language models process text rendered as images, which allows them to handle any script, making them a promising approach to open vocabulary language modelling. However, recent approaches use text renderers that produce a large…

Computation and Language · Computer Science 2023-11-02 Jonas F. Lotz , Elizabeth Salesky , Phillip Rust , Desmond Elliott

Evaluating Pixel Language Models on Non-Standardized Languages

We explore the potential of pixel-based models for transfer learning from standard languages to dialects. These models convert text into images that are divided into patches, enabling a continuous vocabulary representation that proves…

Computation and Language · Computer Science 2024-12-13 Alberto Muñoz-Ortiz , Verena Blaschke , Barbara Plank

Multilingual Pixel Representations for Translation and Effective Cross-lingual Transfer

We introduce and demonstrate how to effectively train multilingual machine translation models with pixel representations. We experiment with two different data settings with a variety of language and script coverage, demonstrating improved…

Computation and Language · Computer Science 2023-10-25 Elizabeth Salesky , Neha Verma , Philipp Koehn , Matt Post

Uncertainty in Semantic Language Modeling with PIXELS

Pixel-based language models aim to solve the vocabulary bottleneck problem in language modeling, but the challenge of uncertainty quantification remains open. The novelty of this work consists of analysing uncertainty and confidence in…

Computation and Language · Computer Science 2025-09-25 Stefania Radu , Marco Zullich , Matias Valdenegro-Toro

Overcoming Vocabulary Constraints with Pixel-level Fallback

Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models…

Computation and Language · Computer Science 2025-08-12 Jonas F. Lotz , Hendra Setiawan , Stephan Peitz , Yova Kementchedjhieva

MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts

Pixel-based language models are gaining momentum as alternatives to traditional token-based approaches, promising to circumvent tokenization challenges. However, the inherent perceptual diversity across languages poses a significant hurdle…

Computation and Language · Computer Science 2026-04-14 Chen Hu , Yintao Tai , Antonio Vergari , Frank Keller , Alessandro Suglia

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image…

Computer Vision and Pattern Recognition · Computer Science 2020-06-23 Zhicheng Huang , Zhaoyang Zeng , Bei Liu , Dongmei Fu , Jianlong Fu

PIXAR: Auto-Regressive Language Modeling in Pixel Space

Recent work showed the possibility of building open-vocabulary large language models (LLMs) that directly operate on pixel representations. These models are implemented as autoencoders that reconstruct masked patches of rendered text.…

Computation and Language · Computer Science 2024-02-27 Yintao Tai , Xiyang Liao , Alessandro Suglia , Antonio Vergari

PIXELS: Progressive Image Xemplar-based Editing with Latent Surgery

Recent advancements in language-guided diffusion models for image editing are often bottle-necked by cumbersome prompt engineering to precisely articulate desired changes. An intuitive alternative calls on guidance from in-the-wild image…

Computer Vision and Pattern Recognition · Computer Science 2025-01-20 Shristi Das Biswas , Matthew Shreve , Xuelu Li , Prateek Singhal , Kaushik Roy

Language Through a Prism: A Spectral Approach for Multiscale Language Representations

Language exhibits structure at different scales, ranging from subwords to words, sentences, paragraphs, and documents. To what extent do deep models capture information at these scales, and can we force them to better capture structure…

Computation and Language · Computer Science 2020-11-11 Alex Tamkin , Dan Jurafsky , Noah Goodman

PixelWorld: How Far Are We from Perceiving Everything as Pixels?

Recent agentic language models increasingly need to interact with real-world environments that contain tightly intertwined visual and textual information, often through raw camera pixels rather than separately processed images and tokenized…

Computer Vision and Pattern Recognition · Computer Science 2025-10-23 Zhiheng Lyu , Xueguang Ma , Wenhu Chen

Improving Language Understanding from Screenshots

An emerging family of language models (LMs), capable of processing both text and images within a single visual view, has the promise to unlock complex tasks such as chart understanding and UI navigation. We refer to these models as…

Computation and Language · Computer Science 2024-02-27 Tianyu Gao , Zirui Wang , Adithya Bhaskar , Danqi Chen

PixelLM: Pixel Reasoning with Large Multimodal Model

While large multimodal models (LMMs) have achieved remarkable progress, generating pixel-level masks for image reasoning tasks involving multiple open-world targets remains a challenge. To bridge this gap, we introduce PixelLM, an effective…

Computer Vision and Pattern Recognition · Computer Science 2024-07-19 Zhongwei Ren , Zhicheng Huang , Yunchao Wei , Yao Zhao , Dongmei Fu , Jiashi Feng , Xiaojie Jin

Pixel Sentence Representation Learning

Pretrained language models are long known to be subpar in capturing sentence and document-level semantics. Though heavily investigated, transferring perturbation-based methods from unsupervised visual representation learning to NLP remains…

Computation and Language · Computer Science 2024-02-14 Chenghao Xiao , Zhuoxu Huang , Danlu Chen , G Thomas Hudson , Yizhi Li , Haoran Duan , Chenghua Lin , Jie Fu , Jungong Han , Noura Al Moubayed

Revisiting Language Encoding in Learning Multilingual Representations

Transformer has demonstrated its great power to learn contextual word representations for multiple languages in a single model. To process multilingual sentences in the model, a learnable vector is usually assigned to each language, which…

Computation and Language · Computer Science 2021-02-17 Shengjie Luo , Kaiyuan Gao , Shuxin Zheng , Guolin Ke , Di He , Liwei Wang , Tie-Yan Liu

In Pursuit of Pixel Supervision for Visual Pre-training

At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a…

Computer Vision and Pattern Recognition · Computer Science 2025-12-18 Lihe Yang , Shang-Wen Li , Yang Li , Xinjie Lei , Dong Wang , Abdelrahman Mohamed , Hengshuang Zhao , Hu Xu

Pixel Aligned Language Models

Large language models have achieved great success in recent years, so as their variants in vision. Existing vision-language models can describe images in natural languages, answer visual-related questions, or perform complex reasoning about…

Computer Vision and Pattern Recognition · Computer Science 2023-12-15 Jiarui Xu , Xingyi Zhou , Shen Yan , Xiuye Gu , Anurag Arnab , Chen Sun , Xiaolong Wang , Cordelia Schmid

PERL: Parameter Efficient Reasoning in CLIP Latent Space

Contrastively trained vision-language models such as CLIP provide strong zero-shot transfer by aligning images and text in a shared embedding space. However, adapting these models to downstream tasks without degrading their open-vocabulary…

Computer Vision and Pattern Recognition · Computer Science 2026-05-20 Simone Carnemolla , Salvatore Calcagno , Daniela Giordano , Concetto Spampinato , Matteo Pennisi