Related papers: Similarity Driven Approximation for Text Analytics

The Marriage of Incremental and Approximate Computing

Most data analytics systems that require low-latency execution and efficient utilization of computing resources, increasingly adopt two computational paradigms, namely, incremental and approximate computing. Incremental computation updates…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-28 Dhanya R Krishnan

Approximate Stream Analytics in Apache Flink and Apache Spark Streaming

Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-12 Do Le Quoc , Ruichuan Chen , Pramod Bhatotia , Christof Fetze , Volker Hilt , Thorsten Strufe

Approximate Computation for Big Data Analytics

Over the past a few years, research and development has made significant progresses on big data analytics. A fundamental issue for big data analytics is the efficiency. If the optimal solution is unable to attain or not required or has a…

Databases · Computer Science 2019-01-03 Shuai Ma , Jinpeng Huai

Graph-based Semantical Extractive Text Analysis

In the past few decades, there has been an explosion in the amount of available data produced from various sources with different topics. The availability of this enormous data necessitates us to adopt effective computational tools to…

Computation and Language · Computer Science 2022-12-20 Mina Samizadeh

Description-Based Text Similarity

Identifying texts with a given semantics is central for many information seeking scenarios. Similarity search over vector embeddings appear to be central to this ability, yet the similarity reflected in current text embeddings is…

Computation and Language · Computer Science 2024-07-25 Shauli Ravfogel , Valentina Pyatkin , Amir DN Cohen , Avshalom Manevich , Yoav Goldberg

Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning

We focus on Text-to-SQL semantic parsing from the perspective of retrieval-augmented generation. Motivated by challenges related to the size of commercial database schemata and the deployability of business intelligence solutions, we…

Computation and Language · Computer Science 2024-11-05 Zhili Shen , Pavlos Vougiouklis , Chenxin Diao , Kaustubh Vyas , Yuanyi Ji , Jeff Z. Pan

Unsupervised Matching of Data and Text

Entity resolution is a widely studied problem with several proposals to match records across relations. Matching textual content is a widespread task in many applications, such as question answering and search. While recent methods achieve…

Databases · Computer Science 2021-12-17 Naser Ahmadi , Hansjorg Sand , Paolo Papotti

A Resource-Light Method for Cross-Lingual Semantic Textual Similarity

Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for…

Computation and Language · Computer Science 2018-01-22 Goran Glavaš , Marc Franco-Salvador , Simone Paolo Ponzetto , Paolo Rosso

Leveraging Approximate Caching for Faster Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) improves the reliability of large language model (LLM) answers by integrating external knowledge. However, RAG increases the end-to-end inference time since looking for relevant documents from large…

Databases · Computer Science 2025-10-28 Shai Bergman , Anne-Marie Kermarrec , Diana Petrescu , Rafael Pires , Mathis Randl , Martijn de Vos , Ji Zhang

Towards an Approximation-Aware Computational Workflow Framework for Accelerating Large-Scale Discovery Tasks

The use of approximation is fundamental in computational science. Almost all computational methods adopt approximations in some form in order to obtain a favourable cost/accuracy trade-off and there are usually many approximations that…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-06-14 Michael A. Johnston , Vassilis Vassiliadis

Improved ESP-index: a practical self-index for highly repetitive texts

While several self-indexes for highly repetitive texts exist, developing a practical self-index applicable to real world repetitive texts remains a challenge. ESP-index is a grammar-based self-index on the notion of edit-sensitive parsing…

Data Structures and Algorithms · Computer Science 2014-04-29 Yoshimasa Takabatake , Yasuo Tabei , Hiroshi Sakamoto

VeilGraph: Streaming Graph Approximations

Graphs are found in a plethora of domains, including online social networks, the World Wide Web and the study of epidemics, to name a few. With the advent of greater volumes of information and the need for continuously updated results under…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-12-19 Miguel E. Coimbra , Sérgio Esteves , Alexandre P. Francisco , Luís Veiga

Approximation with Error Bounds in Spark

We introduce a sampling framework to support approximate computing with estimated error bounds in Spark. Our framework allows sampling to be performed at the beginning of a sequence of multiple transformations ending in an aggregation…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-07 Guangyan Hu , Desheng Zhang , Sandro Rigo , Thu D. Nguyen

Proximity full-text searches of frequently occurring words with a response time guarantee

Full-text search engines are important tools for information retrieval. In a proximity full-text search, a document is relevant if it contains query terms near each other, especially if the query terms are frequently occurring words. For…

Information Retrieval · Computer Science 2020-09-09 Alexander B. Veretennikov

Proximity Full-Text Search with a Response Time Guarantee by Means of Additional Indexes

Full-text search engines are important tools for information retrieval. Term proximity is an important factor in relevance score measurement. In a proximity full-text search, we assume that a relevant document contains query terms near each…

Information Retrieval · Computer Science 2018-11-20 Alexander B. Veretennikov

Approximate textual retrieval

An approximate textual retrieval algorithm for searching sources with high levels of defects is presented. It considers splitting the words in a query into two overlapping segments and subsequently building composite regular expressions…

Information Retrieval · Computer Science 2007-05-23 Pere Constans

SimDoc: Topic Sequence Alignment based Document Similarity Framework

Document similarity is the problem of estimating the degree to which a given pair of documents has similar semantic content. An accurate document similarity measure can improve several enterprise relevant tasks such as document clustering,…

Computation and Language · Computer Science 2017-11-15 Gaurav Maheshwari , Priyansh Trivedi , Harshita Sahijwani , Kunal Jha , Sourish Dasgupta , Jens Lehmann

Text embedding models can be great data engineers

Data engineering pipelines are essential - albeit costly - components of predictive analytics frameworks requiring significant engineering time and domain expertise for carrying out tasks such as data ingestion, preprocessing, feature…

Machine Learning · Computer Science 2025-05-22 Iman Kazemian , Paritosh Ramanan , Murat Yildirim

One of the important factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate…

Information Retrieval · Computer Science 2019-09-26 Hamid Mohammadi , Seyed Hossein Khasteh

A Survey on Efficient Processing of Similarity Queries over Neural Embeddings

Similarity query is the family of queries based on some similarity metrics. Unlike the traditional database queries which are mostly based on value equality, similarity queries aim to find targets "similar enough to" the given data objects,…

Databases · Computer Science 2022-04-19 Yifan Wang