Related papers: Scalable Text Mining with Sparse Generative Models
The exponential growth of textual data presents substantial challenges in management and analysis, notably due to high storage and processing costs. Text classification, a vital aspect of text mining, provides robust solutions by enabling…
As the amount of textual data has been rapidly increasing over the past decade, efficient similarity search methods have become a crucial component of large-scale information retrieval systems. A popular strategy is to represent original…
We use commercially available text analysis technology to process interview text data from a computational social science study. We find that topical clustering and terminological enrichment provide for convenient exploration and…
Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement…
A text stream is an ordered sequence of text documents generated over time. A massive amount of such text data is generated by online social platforms every day. Designing an algorithm for such text streams to extract useful information is…
The amount of text that is generated every day is increasing dramatically. This tremendous volume of mostly unstructured text cannot be simply processed and perceived by computers. Therefore, efficient and effective techniques and…
Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire…
The avalanche quantity of the information developed by mankind has led to concept of automation of knowledge extraction - Data Mining ([1]). This direction is connected with a wide spectrum of problems - from recognition of the fuzzy set to…
While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inherent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing…
Text data mining is the process of deriving essential information from language text. Typical text mining tasks include text categorization, text clustering, topic modeling, information extraction, and text summarization. Various data sets…
Prototype-driven text generation uses non-parametric models that first choose from a library of sentence "prototypes" and then modify the prototype to generate the output text. While effective, these methods are inefficient at test time as…
Graphs and networks are a key research tool for a variety of science fields, most notably chemistry, biology, engineering and social sciences. Modeling and generation of graphs with efficient sampling is a key challenge for graphs. In…
In the implementation and use of research information systems (RIS) in scientific institutions, text data mining and semantic technologies are a key technology for the meaningful use of large amounts of data. It is not the collection of…
Earlier techniques of text mining included algorithms like k-means, Naive Bayes, SVM which classify and cluster the text document for mining relevant information about the documents. The need for improving the mining techniques has us…
With the information explosion on the Web, search and recommendation are foundational infrastructures to satisfying users' information needs. As the two sides of the same coin, both revolve around the same core research problem, matching…
Analyzing textual data is a very challenging task because of the huge volume of data generated daily. Fundamental issues in text analysis include the lack of structure in document datasets, the need for various preprocessing steps %(e.g.,…
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility by incorporating external contexts. However, the input length grows linearly in the number of retrieved documents, causing a dramatic…
Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. While topic models can potentially discover a broad range of themes in a data set,…
In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine…
Machine learning approaches to multi-label document classification have to date largely relied on discriminative modeling techniques such as support vector machines. A drawback of these approaches is that performance rapidly drops off as…