Related papers: Decoding billions of integers per second through v…

SIMD Compression and the Intersection of Sorted Integers

Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the SIMD instructions available in common processors to boost the speed of integer compression schemes. Our…

Information Retrieval · Computer Science 2020-04-22 Daniel Lemire , Leonid Boytsov , Nathan Kurz

Vectorized VByte Decoding

We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a…

Information Retrieval · Computer Science 2017-01-17 Jeff Plaisance , Nathan Kurz , Daniel Lemire

A General SIMD-based Approach to Accelerating Compression Algorithms

Compression algorithms are important for data oriented tasks, especially in the era of Big Data. Modern processors equipped with powerful SIMD instruction sets, provide us an opportunity for achieving better compression performance.…

Information Retrieval · Computer Science 2015-04-15 Wayne Xin Zhao , Xudong Zhang , Daniel Lemire , Dongdong Shan , Jian-Yun Nie , Hongfei Yan , Ji-Rong Wen

Scanning HTML at Tens of Gigabytes per Second on ARM Processors

Modern processors have instructions to process 16 bytes or more at once. These instructions are called SIMD, for single instruction, multiple data. Recent advances have leveraged SIMD instructions to accelerate parsing of common Internet…

Data Structures and Algorithms · Computer Science 2025-06-05 Daniel Lemire

Upscaledb: Efficient Integer-Key Compression in a Key-Value Store using SIMD Instructions

Compression can sometimes improve performance by making more of the data available to the processors faster. We consider the compression of integer keys in a B+-tree index. For this purpose, systems such as IBM DB2 use variable-byte…

Databases · Computer Science 2017-01-18 Daniel Lemire , Christoph Rupp

An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes

In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (10pJ/word on-die to 10,000pJ/word on the network). To increase memory locality at the hardware level and reduce…

Computational Physics · Physics 2018-01-17 H. Vincenti , R. Lehe , R. Sasanka , J-L. Vay

On Optimally Partitioning Variable-Byte Codes

The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to…

Information Retrieval · Computer Science 2022-02-08 Giulio Ermanno Pibiri , Rossano Venturini

Stream VByte: Faster Byte-Oriented Integer Compression

Arrays of integers are often compressed in search engines. Though there are many ways to compress integers, we are interested in the popular byte-oriented integer compression techniques (e.g., VByte or Google's Varint-GB). They are…

Information Retrieval · Computer Science 2017-10-11 Daniel Lemire , Nathan Kurz , Christoph Rupp

Adaptive SIMD optimizations in particle-in-cell codes with fine-grain particle sorting

Particle-In-Cell (PIC) codes are broadly applied to the kinetic simulation of plasmas, from laser-matter interaction to astrophysics. Their heavy simulation cost can be mitigated by using the Single Instruction Multiple Data (SIMD)…

Computational Physics · Physics 2019-10-02 Arnaud Beck , Julien Dérouillat , Mathieu Lobet , Asma Farjallah , Francesco Massimo , Imen Zemzemi , Frédéric Perez , Tommaso Vinci , Mickael Grech

Converting an Integer to a Decimal String in Under Two Nanoseconds

Converting binary integers to variable-length decimal strings is a fundamental operation in computing. Conventional fast approaches rely on recursive division and small lookup tables. We propose a SIMD-based algorithm that leverages integer…

Data Structures and Algorithms · Computer Science 2026-05-07 Jaël Champagne Gareau , Daniel Lemire

Bolt: Accelerated Data Mining with Fast Vector Compression

Vectors of data are at the heart of machine learning and data mining. Recently, vector quantization methods have shown great promise in reducing both the time and space costs of operating on vectors. We introduce a vector quantization…

Performance · Computer Science 2017-07-03 Davis W Blalock , John V Guttag

Faster Population Counts Using AVX2 Instructions

Counting the number of ones in a binary stream is a common operation in database, information-retrieval, cryptographic and machine-learning applications. Most processors have dedicated instructions to count the number of ones in a word…

Data Structures and Algorithms · Computer Science 2018-09-07 Wojciech Muła , Nathan Kurz , Daniel Lemire

Transcoding Billions of Unicode Characters per Second with SIMD Instructions

In software, text is often represented using Unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-16 Daniel Lemire , Wojciech Muła

The Fast Fibonacci Decompression Algorithm

Data compression has been widely applied in many data processing areas. Compression methods use variable-size codes with the shorter codes assigned to symbols or groups of symbols that appear in the data frequently. Fibonacci coding, as a…

Performance · Computer Science 2007-12-19 R. Baca , V. Snasel , J. Platos , M. Kratky , E. El-Qawasmeh

Faster Base64 Encoding and Decoding Using AVX2 Instructions

Web developers use base64 formats to include images, fonts, sounds and other resources directly inside HTML, JavaScript, JSON and XML files. We estimate that billions of base64 messages are decoded every day. We are motivated to improve the…

Mathematical Software · Computer Science 2026-04-07 Wojciech Muła , Daniel Lemire

An Efficient Vectorization Scheme for Stencil Computation

Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization and tiling techniques, aiming at exploiting the in-core data parallelism and data…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-19 Kun Li , Liang Yuan , Yunquan Zhang , Yue Yue , Hang Cao , Pengqi Lu

Short reasons for long vectors in HPC CPUs: a study based on RISC-V

For years, SIMD/vector units have enhanced the capabilities of modern CPUs in High-Performance Computing (HPC) and mobile technology. Typical commercially-available SIMD units process up to 8 double-precision elements with one instruction.…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-14 Pablo Vizcaino , Georgios Ieronymakis , Nikolaos Dimou , Vassilis Papaefstathiou , Jesus Labarta , Filippo Mantovani

Comparing the Performance of Different x86 SIMD Instruction Sets for a Medical Imaging Application on Modern Multi- and Manycore Chips

Single Instruction, Multiple Data (SIMD) vectorization is a major driver of performance in current architectures, and is mandatory for achieving good performance with codes that are limited by instruction throughput. We investigate the…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-01-30 Johannes Hofmann , Jan Treibig , Georg Hager , Gerhard Wellein

Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More

Deep learning implementations on CPUs (Central Processing Units) are gaining more traction. Enhanced AI capabilities on commodity x86 architectures are commercially appealing due to the reuse of existing hardware and virtualization ease. A…

Machine Learning · Computer Science 2021-03-22 Shabnam Daghaghi , Nicholas Meisburger , Mengnan Zhao , Yong Wu , Sameh Gobriel , Charlie Tai , Anshumali Shrivastava

Simplicity Scales

The dominant data interchange formats encode integers using a variable number of bytes or represent floating-point numbers as variable-length UTF-8 strings. The decoder must inspect each byte for a continuation bit or parse each character…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-14 Andrew Sampson , Yuta Saito , Ronny Chan