Related papers: Computing n-Gram Statistics in MapReduce
The number of n-gram features grows exponentially in n, making it computationally demanding to compute the most frequent n-grams even for n as small as 3. Motivated by our production machine learning system built on n-gram features, we ask:…
Stemming is a process that can be utilized to trim inflected words to stem or root form. It is useful for enhancing the retrieval effectiveness, especially for text search in order to solve the mismatch problems. Previous research on Bangla…
This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and…
This report describes the MUDOS-NG summarization system, which applies a set of language-independent and generic methods for generating extractive summaries. The proposed methods are mostly combinations of simple operators on a generic…
Efficient evaluation of regular expressions (regex, for short) is crucial for text analysis, and n-gram indexes are fundamental to achieving fast regex evaluation performance. However, these indexes face scalability challenges because of…
Recent advances in Deep Learning have led to a significant performance increase on several NLP tasks, however, the models become more and more computationally demanding. Therefore, this paper tackles the domain of computationally efficient…
This work extends the set of works which deal with the popular problem of sentiment analysis in Twitter. It investigates the most popular document ("tweet") representation methods which feed sentiment evaluation mechanisms. In particular,…
In multimedia, text or bioinformatics databases, applications query sequences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem. While view sizes can be estimated by…
The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively…
Searching techniques for Case Based Reasoning systems involve extensive methods of elimination. In this paper, we look at a new method of arriving at the right solution by performing a series of transformations upon the data. These involve…
MapReduce (and its open source implementation Hadoop) has become the de facto platform for processing large data sets. MapReduce offers a streamlined computational framework by interleaving sequential and parallel computation while hiding…
We present NN-grams, a novel, hybrid language model integrating n-grams and neural networks (NN) for speech recognition. The model takes as input both word histories as well as n-gram counts. Thus, it combines the memorization capacity and…
This paper proposes a novel Recurrent Neural Network (RNN) language model that takes advantage of character information. We focus on character n-grams based on research in the field of word embedding construction (Wieting et al. 2016). Our…
There have been multiple attempts to resolve various inflection matching problems in information retrieval. Stemming is a common approach to this end. Among many techniques for stemming, statistical stemming has been shown to be effective…
We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many practical data mining problems in social network analysis and bioinformatics. We present novel parallel algorithms for the…
The problem of finding locally dense components of a graph is an important primitive in data analysis, with wide-ranging applications from community mining to spam detection and the discovery of biological network modules. In this paper we…
String-based (or viewpoint) models of tonal harmony often struggle with data sparsity in pattern discovery and prediction tasks, particularly when modeling composite events like triads and seventh chords, since the number of distinct n-note…
Natural language processing models have attracted much interest in the deep learning community. This branch of study is composed of some applications such as machine translation, sentiment analysis, named entity recognition, question and…
We study distributed algorithms for some fundamental problems in data summarization. Given a communication graph $G$ of $n$ nodes each of which may hold a value initially, we focus on computing $\sum_{i=1}^N g(f_i)$, where $f_i$ is the…
Given a graph G and the desired size k in bits, how can we summarize G within k bits, while minimizing the information loss? Large-scale graphs have become omnipresent, posing considerable computational challenges. Analyzing such large…