Related papers: A framework for space-efficient string kernels
Kernel methods are powerful tools in machine learning. They have to be computationally efficient. In this paper, we present a novel Geometric-based approach to compute efficiently the string subsequence kernel (SSK). Our main idea is that…
Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either…
String Kernel (SK) techniques, especially those using gapped $k$-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we…
Low-rank approximation is a common tool used to accelerate kernel methods: the $n \times n$ kernel matrix $K$ is approximated via a rank-$k$ matrix $\tilde K$ which can be stored in much less space and processed more quickly. In this work…
String kernels are attractive data analysis tools for analyzing string data. Among them, alignment kernels are known for their high prediction accuracies in string classifications when tested in combination with SVM in various applications.…
We present a novel framework for kernel learning with sequential data of any kind, such as time series, sequences of graphs, or strings. Our approach is based on signature features which can be seen as an ordered variant of sample…
Analysis of large-scale sequential data has been one of the most crucial tasks in areas such as bioinformatics, text, and audio mining. Existing string kernels, however, either (i) rely on local features of short substructures in the…
The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a…
Background: Short sequence substrings of a fixed length k, called k-mers, are a ubiquitous computational primitive in bioinformatics, used across sequence indexing, read mapping, genome assembly, metagenomic classification, and comparative…
Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in…
Kernel methods have great promise for learning rich statistical representations of large modern datasets. However, compared to neural networks, kernel methods have been perceived as lacking in scalability and flexibility. We introduce a…
Many scientific problems require identifying a small set of covariates that are associated with a target response and estimating their effects. Often, these effects are nonlinear and include interactions, so linear and additive methods can…
Some recent results have introduced external-memory algorithms to compute self-indexes of a set of strings, mainly via computing the Burrows-Wheeler Transform (BWT) of the input strings. The motivations for those results stem from…
Kernel density estimation is a convenient way to estimate the probability density of a distribution given the sample of data points. However, it has certain drawbacks: proper description of the density using narrow kernels needs large data…
A classic data structure problem is to preprocess a string T of length $n$ so that, given a query $q$, we can quickly find all substrings of T with Hamming distance at most $k$ from the query string. Variants of this problem have seen…
Described are two algorithms to find long approximate palindromes in a string, for example a DNA sequence. A simple algorithm requires O(n)-space and almost always runs in $O(k.n)$-time where n is the length of the string and k is the…
The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the…
A degenerate or indeterminate string on an alphabet $\Sigma$ is a sequence of non-empty subsets of $\Sigma$. Given a degenerate string $t$ of length $n$, we present a new method based on the Burrows--Wheeler transform for searching for a…
We propose a new kernel for biological sequences which borrows ideas and techniques from information theory and data compression. This kernel can be used in combination with any kernel method, in particular Support Vector Machines for…
Strings are a natural representation of biological data such as DNA, RNA and protein sequences. The problem of finding a string that summarizes a set of sequences has direct application in relative compression algorithms for genome and…