Related papers: Implementing Suffix Array Algorithm Using Apache B…

Scalable Construction of Text Indexes

The suffix array is the key to efficient solutions for myriads of string processing problems in different applications domains, like data compression, data mining, or Bioinformatics. With the rapid growth of available data, suffix array…

Data Structures and Algorithms · Computer Science 2016-10-11 Timo Bingmann , Simon Gog , Florian Kurpicz

Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays

The suffix array is an efficient data structure for in-memory pattern search. Suffix arrays can also be used for external-memory pattern search, via two-level structures that use an internal index to identify the correct block of suffix…

Data Structures and Algorithms · Computer Science 2013-03-27 Simon Gog , Alistair Moffat , J. Shane Culpepper , Andrew Turpin , Anthony Wirth

Suffixient Arrays: a New Efficient Suffix Array Compression Technique

The Suffix Array is a classic text index enabling on-line pattern matching queries via simple binary search. The main drawback of the Suffix Array is that it takes linear space in the text's length, even if the text itself is extremely…

Data Structures and Algorithms · Computer Science 2025-03-19 Davide Cenzato , Lore Depuydt , Travis Gagie , Sung-Hwan Kim , Giovanni Manzini , Francisco Olivares , Nicola Prezza

Efficient repeat finding via suffix arrays

We solve the problem of finding interspersed maximal repeats using a suffix array construction. As it is well known, all the functionality of suffix trees can be handled by suffix arrays, gaining practicality. Our solution improves the…

Data Structures and Algorithms · Computer Science 2013-04-03 Veronica Becher , Alejandro Deymonnaz , Pablo Ariel Heiber

Alphabet-dependent Parallel Algorithm for Suffix Tree Construction for Pattern Searching

Suffix trees have recently become very successful data structures in handling large data sequences such as DNA or Protein sequences. Consequently parallel architectures have become ubiquitous. We present a novel alphabet-dependent parallel…

Data Structures and Algorithms · Computer Science 2017-04-20 Freeson Kaniwa , Venu Madhav Kuthadi , Otlhapile Dinakenyane , Heiko Schroeder

Scalable and Efficient Construction of Suffix Array with MapReduce and In-Memory Data Store System

Suffix Array (SA) is a cardinal data structure in many pattern matching applications, including data compression, plagiarism detection and sequence alignment. However, as the volumes of data increase abruptly, the construction of SA is not…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-05-16 Hsiang-Huang Wu , Chien-Min Wang , Hsuan-Chi Kuo , Wei-Chun Chung , Jan-Ming Ho

Suffix arrays with a twist

The suffix array is a classic full-text index, combining effectiveness with simplicity. We discuss three approaches aiming to improve its efficiency even more: changes to the navigation, data layout and adding extra data. In short, we show…

Data Structures and Algorithms · Computer Science 2016-07-28 Tomasz Kowalski , Szymon Grabowski , Kimmo Fredriksson , Marcin Raniszewski

An Information Theoretic Feature Selection Framework for Big Data under Apache Spark

With the advent of extremely high dimensional datasets, dimensionality reduction techniques are becoming mandatory. Among many techniques, feature selection has been growing in interest as an important tool to identify relevant features on…

Artificial Intelligence · Computer Science 2016-10-20 Sergio Ramírez-Gallego , Héctor Mouriño-Talín , David Martínez-Rego , Verónica Bolón-Canedo , José Manuel Benítez , Amparo Alonso-Betanzos , Francisco Herrera

Engineering a Distributed Full-Text Index

We present a distributed full-text index for big data applications in a distributed environment. Our index can answer different types of pattern matching queries (existential, counting and enumeration). We perform experiments on inputs up…

Data Structures and Algorithms · Computer Science 2016-12-07 Johannes Fischer , Florian Kurpicz , Peter Sanders

On the Optimisation of the GSACA Suffix Array Construction Algorithm

The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known…

Data Structures and Algorithms · Computer Science 2022-08-31 Jannik Olbrich , Enno Ohlebusch , Thomas Büchler

On the Use of Suffix Arrays for Memory-Efficient Lempel-Ziv Data Compression

Much research has been devoted to optimizing algorithms of the Lempel-Ziv (LZ) 77 family, both in terms of speed and memory requirements. Binary search trees and suffix trees (ST) are data structures that have been often used for this…

Data Structures and Algorithms · Computer Science 2016-11-17 Artur Ferreira , Arlindo Oliveira , Mario Figueiredo

Suffix sorting via matching statistics

We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a…

Data Structures and Algorithms · Computer Science 2024-04-16 Zsuzsanna Lipták , Francesco Masillo , Simon J. Puglisi

Compressed Spaced Suffix Arrays

Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data…

Data Structures and Algorithms · Computer Science 2014-03-11 Travis Gagie , Giovanni Manzini , Daniel Valenzuela

Using a Big Data Database to Identify Pathogens in Protein Data Space

Current metagenomic analysis algorithms require significant computing resources, can report excessive false positives (type I errors), may miss organisms (type II errors / false negatives), or scale poorly on large datasets. This paper…

Databases · Computer Science 2015-01-23 Ashley Mae Conard , Stephanie Dodson , Jeremy Kepner , Darrell Ricke

An Elegant Algorithm for the Construction of Suffix Arrays

The suffix array is a data structure that finds numerous applications in string processing problems for both linguistic texts and biological data. It has been introduced as a memory efficient alternative for suffix trees. The suffix array…

Data Structures and Algorithms · Computer Science 2013-07-05 Sanguthevar Rajasekaran , Marius Nicolae

Scaling associative classification for very large datasets

Supervised learning algorithms are nowadays successfully scaling up to datasets that are very large in volume, leveraging the potential of in-memory cluster-computing Big Data frameworks. Still, massive datasets with a number of…

Machine Learning · Computer Science 2018-05-11 Luca Venturini , Elena Baralis , Paolo Garza

Parallel Suffix Array Construction by Accelerated Sampling

A deterministic BSP algorithm for constructing the suffix array of a given string is presented, based on a technique which we call accelerated sampling. It runs in optimal O(n/p) local computation and communication, and requires a near…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-02-26 Matthew Felice Pace , Alexander Tiskin

Subdata selection for big data regression: an improved approach

In the big data era researchers face a series of problems. Even standard approaches/methodologies, like linear regression, can be difficult or problematic with huge volumes of data. Traditional approaches for regression in big datasets may…

Methodology · Statistics 2024-11-13 Vasilis Chasiotis , Dimitris Karlis

Explaining the Inherent Tradeoffs for Suffix Array Functionality: Equivalences between String Problems and Prefix Range Queries

We study the fundamental question of how efficiently suffix array entries can be accessed when the array cannot be stored explicitly. The suffix array $SA_T[1..n]$ of a text $T$ of length $n$ encodes the lexicographic order of its suffixes…

Data Structures and Algorithms · Computer Science 2025-10-23 Dominik Kempa , Tomasz Kociumaka

DPASF: A Flink Library for Streaming Data preprocessing

Data preprocessing techniques are devoted to correct or alleviate errors in data. Discretization and feature selection are two of the most extended data preprocessing techniques. Although we can find many proposals for static Big Data…

Databases · Computer Science 2018-10-16 Alejandro Alcalde-Barros , Diego García-Gil , Salvador García , Francisco Herrera