Related papers: Computing n-Gram Statistics in MapReduce

Intermediate N-Gramming: Deterministic and Fast N-Grams For Large N and Large Datasets

The number of n-gram features grows exponentially in n, making it computationally demanding to compute the most frequent n-grams even for n as small as 3. Motivated by our production machine learning system built on n-gram features, we ask:…

Data Structures and Algorithms · Computer Science 2025-11-20 Ryan R. Curtin , Fred Lu , Edward Raff , Priyanka Ranade

N-gram Statistical Stemmer for Bangla Corpus

Stemming is a process that can be utilized to trim inflected words to stem or root form. It is useful for enhancing the retrieval effectiveness, especially for text search in order to solve the mismatch problems. Previous research on Bangla…

Computation and Language · Computer Science 2019-12-30 Rabeya Sadia , Md Ataur Rahman , Md Hanif Seddiqui

Handling Massive N-Gram Datasets Efficiently

This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and…

Information Retrieval · Computer Science 2022-02-08 Giulio Ermanno Pibiri , Rossano Venturini

MUDOS-NG: Multi-document Summaries Using N-gram Graphs (Tech Report)

This report describes the MUDOS-NG summarization system, which applies a set of language-independent and generic methods for generating extractive summaries. The proposed methods are mostly combinations of simple operators on a generic…

Computation and Language · Computer Science 2010-12-10 George Giannakopoulos , George Vouros , Vangelis Karkaletsis

An Evaluation of N-Gram Selection Strategies for Regular Expression Indexing in Contemporary Text Analysis Tasks. Extended Version

Efficient evaluation of regular expressions (regex, for short) is crucial for text analysis, and n-gram indexes are fundamental to achieving fast regex evaluation performance. However, these indexes face scalability challenges because of…

Databases · Computer Science 2025-09-08 Ling Zhang , Shaleen Deep , Jignesh M. Patel , Karthikeyan Sankaralingam

HyperEmbed: Tradeoffs Between Resources and Performance in NLP Tasks with Hyperdimensional Computing enabled Embedding of n-gram Statistics

Recent advances in Deep Learning have led to a significant performance increase on several NLP tasks, however, the models become more and more computationally demanding. Therefore, this paper tackles the domain of computationally efficient…

Computation and Language · Computer Science 2022-05-18 Pedro Alonso , Kumar Shridhar , Denis Kleyko , Evgeny Osipov , Marcus Liwicki

Comparing methods for Twitter Sentiment Analysis

This work extends the set of works which deal with the popular problem of sentiment analysis in Twitter. It investigates the most popular document ("tweet") representation methods which feed sentiment evaluation mechanisms. In particular,…

Computation and Language · Computer Science 2015-05-14 Evangelos Psomakelis , Konstantinos Tserpes , Dimosthenis Anagnostopoulos , Theodora Varvarigou

One-Pass, One-Hash n-Gram Statistics Estimation

In multimedia, text or bioinformatics databases, applications query sequences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem. While view sizes can be estimated by…

Databases · Computer Science 2014-02-05 Daniel Lemire , Owen Kaser

Analyzing Large-Scale, Distributed and Uncertain Data

The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively…

Databases · Computer Science 2017-12-06 Yaron Gonen

Search Using N-gram Technique Based Statistical Analysis for Knowledge Extraction in Case Based Reasoning Systems

Searching techniques for Case Based Reasoning systems involve extensive methods of elimination. In this paper, we look at a new method of arriving at the right solution by performing a series of transformations upon the data. These involve…

Artificial Intelligence · Computer Science 2007-05-23 M. N. Karthik , Moshe Davis

A Conditional Lower Bound on Graph Connectivity in MapReduce

MapReduce (and its open source implementation Hadoop) has become the de facto platform for processing large data sets. MapReduce offers a streamlined computational framework by interleaving sequential and parallel computation while hiding…

Computational Complexity · Computer Science 2019-04-22 Sungjin Im , Benjamin Moseley

NN-grams: Unifying neural network and n-gram language models for Speech Recognition

We present NN-grams, a novel, hybrid language model integrating n-grams and neural networks (NN) for speech recognition. The model takes as input both word histories as well as n-gram counts. Thus, it combines the memorization capacity and…

Computation and Language · Computer Science 2016-06-27 Babak Damavandi , Shankar Kumar , Noam Shazeer , Antoine Bruguier

Character n-gram Embeddings to Improve RNN Language Models

This paper proposes a novel Recurrent Neural Network (RNN) language model that takes advantage of character information. We focus on character n-grams based on research in the field of word embedding construction (Wieting et al. 2016). Our…

Computation and Language · Computer Science 2019-06-14 Sho Takase , Jun Suzuki , Masaaki Nagata

SS4MCT: A Statistical Stemmer for Morphologically Complex Texts

There have been multiple attempts to resolve various inflection matching problems in information retrieval. Stemming is a common approach to this end. Among many techniques for stemming, statistical stemming has been shown to be effective…

Information Retrieval · Computer Science 2016-06-22 Javid Dadashkarimi , Hossein Nasr Esfahani , Heshaam Faili , Azadeh Shakery

Enumerating Maximal Bicliques from a Large Graph using MapReduce

We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many practical data mining problems in social network analysis and bioinformatics. We present novel parallel algorithms for the…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-04-22 Arko Provo Mukherjee , Srikanta Tirthapura

Densest Subgraph in Streaming and MapReduce

The problem of finding locally dense components of a graph is an important primitive in data analysis, with wide-ranging applications from community mining to spam detection and the discovery of biological network modules. In this paper we…

Databases · Computer Science 2012-02-01 Bahman Bahmani , Ravi Kumar , Sergei Vassilvitskii

Modeling Harmony with Skip-Grams

String-based (or viewpoint) models of tonal harmony often struggle with data sparsity in pattern discovery and prediction tasks, particularly when modeling composite events like triads and seventh chords, since the number of distinct n-note…

Information Retrieval · Computer Science 2017-07-19 David R. W. Sears , Andreas Arzt , Harald Frostel , Reinhard Sonnleitner , Gerhard Widmer

Morphological Skip-Gram: Using morphological knowledge to improve word representation

Natural language processing models have attracted much interest in the deep learning community. This branch of study is composed of some applications such as machine translation, sentiment analysis, named entity recognition, question and…

Computation and Language · Computer Science 2020-07-22 Flávio Santos , Hendrik Macedo , Thiago Bispo , Cleber Zanchettin

Distributed Data Summarization in Well-Connected Networks

We study distributed algorithms for some fundamental problems in data summarization. Given a communication graph $G$ of $n$ nodes each of which may hold a value initially, we focus on computing $\sum_{i=1}^N g(f_i)$, where $f_i$ is the…

Data Structures and Algorithms · Computer Science 2019-08-07 Hsin-Hao Su , Hoa T. Vu

SSumM: Sparse Summarization of Massive Graphs

Given a graph G and the desired size k in bits, how can we summarize G within k bits, while minimizing the information loss? Large-scale graphs have become omnipresent, posing considerable computational challenges. Analyzing such large…

Databases · Computer Science 2021-02-23 Kyuhan Lee , Hyeonsoo Jo , Jihoon Ko , Sungsu Lim , Kijung Shin