Related papers: A framework for space-efficient string kernels

Efficient Geometric-based Computation of the String Subsequence Kernel

Kernel methods are powerful tools in machine learning. They have to be computationally efficient. In this paper, we present a novel Geometric-based approach to compute efficiently the string subsequence kernel (SSK). Our main idea is that…

Machine Learning · Computer Science 2015-03-02 Slimane Bellaouar , Hadda Cherroun , Djelloul Ziadi

Space-efficient detection of unusual words

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either…

Data Structures and Algorithms · Computer Science 2015-08-13 Djamal Belazzougui , Fabio Cunial

GaKCo: a Fast GApped k-mer string Kernel using COunting

String Kernel (SK) techniques, especially those using gapped $k$-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we…

Machine Learning · Computer Science 2017-09-19 Ritambhara Singh , Arshdeep Sekhon , Kamran Kowsari , Jack Lanchantin , Beilun Wang , Yanjun Qi

Is Input Sparsity Time Possible for Kernel Low-Rank Approximation?

Low-rank approximation is a common tool used to accelerate kernel methods: the $n \times n$ kernel matrix $K$ is approximated via a rank-$k$ matrix $\tilde K$ which can be stored in much less space and processed more quickly. In this work…

Data Structures and Algorithms · Computer Science 2017-11-07 Cameron Musco , David P. Woodruff

Space-efficient Feature Maps for String Alignment Kernels

String kernels are attractive data analysis tools for analyzing string data. Among them, alignment kernels are known for their high prediction accuracies in string classifications when tested in combination with SVM in various applications.…

Machine Learning · Computer Science 2019-11-15 Yasuo Tabei , Yoshihiro Yamanishi , Rasmus Pagh

Kernels for sequentially ordered data

We present a novel framework for kernel learning with sequential data of any kind, such as time series, sequences of graphs, or strings. Our approach is based on signature features which can be seen as an ordered variant of sample…

Machine Learning · Statistics 2016-02-01 Franz J Király , Harald Oberhauser

Efficient Global String Kernel with Random Features: Beyond Counting Substructures

Analysis of large-scale sequential data has been one of the most crucial tasks in areas such as bioinformatics, text, and audio mining. Existing string kernels, however, either (i) rely on local features of short substructures in the…

Machine Learning · Computer Science 2019-12-02 Lingfei Wu , Ian En-Hsu Yen , Siyu Huo , Liang Zhao , Kun Xu , Liang Ma , Shouling Ji , Charu Aggarwal

Data structures to represent a set of k-long DNA sequences

The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a…

Data Structures and Algorithms · Computer Science 2020-06-15 Rayan Chikhi , Jan Holub , Paul Medvedev

Fast Iteration of Spaced k-mers

Background: Short sequence substrings of a fixed length k, called k-mers, are a ubiquitous computational primitive in bioinformatics, used across sequence indexing, read mapping, genome assembly, metagenomic classification, and comparative…

Genomics · Quantitative Biology 2026-05-15 Lucas Czech

Engineering Rank/Select Data Structures for Large-Alphabet Strings

Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in…

Data Structures and Algorithms · Computer Science 2024-05-03 Diego Arroyuelo , Gabriel Carmona , Héctor Larrañaga , Francisco Riveros , Carlos Eugenio Rojas-Morales , Erick Sepúlveda

A la Carte - Learning Fast Kernels

Kernel methods have great promise for learning rich statistical representations of large modern datasets. However, compared to neural networks, kernel methods have been perceived as lacking in scalability and flexibility. We introduce a…

Machine Learning · Computer Science 2014-12-22 Zichao Yang , Alexander J. Smola , Le Song , Andrew Gordon Wilson

The SKIM-FA Kernel: High-Dimensional Variable Selection and Nonlinear Interaction Discovery in Linear Time

Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are nonlinear and include interactions, so linear and additive methods can…

Computation · Statistics 2022-12-02 Raj Agrawal , Tamara Broderick

An External-Memory Algorithm for String Graph Construction

Some recent results have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows-Wheeler Transform (BWT) of the input strings. The motivations for those results stem from…

Data Structures and Algorithms · Computer Science 2015-06-12 Paola Bonizzoni , Gianluca Della Vedova , Yuri Pirola , Marco Previtali , Raffaella Rizzi

Kernel density estimation of a multidimensional efficiency profile

Kernel density estimation is a convenient way to estimate the probability density of a distribution given the sample of data points. However, it has certain drawbacks: proper description of the density using narrow kernels needs large data…

Data Analysis, Statistics and Probability · Physics 2015-02-27 Anton Poluektov

Space-Efficient Text Indexing with Mismatches using Function Inversion

A classic data structure problem is to preprocess a string T of length $n$ so that, given a query $q$, we can quickly find all substrings of T with Hamming distance at most $k$ from the query string. Variants of this problem have seen…

Data Structures and Algorithms · Computer Science 2026-04-03 Jackson Bibbens , Levi Borevitz , Samuel McCauley

Finding Approximate Palindromes in Strings Quickly and Simply

Described are two algorithms to find long approximate palindromes in a string, for example a DNA sequence. A simple algorithm requires O(n)-space and almost always runs in $O(k.n)$-time where n is the length of the string and k is the…

Data Structures and Algorithms · Computer Science 2007-05-23 L. Allison

A New Class of Searchable and Provably Highly Compressible String Transformations

The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the…

Data Structures and Algorithms · Computer Science 2019-02-05 Raffaele Giancarlo , Giovanni Manzini , Giovanna Rosone , Marinella Sciortino

Efficient pattern matching in degenerate strings with the Burrows-Wheeler transform

A degenerate or indeterminate string on an alphabet $\Sigma$ is a sequence of non-empty subsets of $\Sigma$. Given a degenerate string $t$ of length $n$, we present a new method based on the Burrows--Wheeler transform for searching for a…

Data Structures and Algorithms · Computer Science 2017-08-04 Jacqueline W. Daykin , Richard Groult , Yannick Guesnet , Thierry Lecroq , Arnaud Lefebvre , Martine Léonard , Laurent Mouchard , Élise Prieur-Gaston , Bruce Watson

A covariance kernel for proteins

We propose a new kernel for biological sequences which borrows ideas and techniques from information theory and data compression. This kernel can be used in combination with any kernel method, in particular Support Vector Machines for…

Genomics · Quantitative Biology 2011-01-05 Marco Cuturi , Jean-Philippe Vert

Assessing the best edit in perturbation-based iterative refinement algorithms to compute the median string

Strings are a natural representation of biological data such as DNA, RNA and protein sequences. The problem of finding a string that summarizes a set of sequences has direct application in relative compression algorithms for genome and…

Data Structures and Algorithms · Computer Science 2019-12-06 P. Mirabal , J. Abreu , D. Seco