Related papers: Approximating Document Frequency with Term Count V…

A Fisher's exact test justification of the TF-IDF term-weighting scheme

Term frequency-inverse document frequency, or TF-IDF for short, is arguably the most celebrated mathematical expression in the history of information retrieval. Conceived as a simple heuristic quantifying the extent to which a given term's…

Computation and Language · Computer Science 2025-07-31 Paul Sheridan , Zeyad Ahmed , Aitazaz A. Farooque

Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to…

Information Retrieval · Computer Science 2022-11-23 Bakhyt Bakiyev

Inverse-Category-Frequency based supervised term weighting scheme for text categorization

Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval…

Machine Learning · Computer Science 2012-06-07 Deqing Wang , Hui Zhang

Learning Term Discrimination

Document indexing is a key component for efficient information retrieval (IR). After preprocessing steps such as stemming and stop-word removal, document indexes usually store term-frequencies (tf). Along with tf (that only reflects the…

Information Retrieval · Computer Science 2020-04-29 Jibril Frej , Phillipe Mulhem , Didier Schwab , Jean-Pierre Chevallet

Testing different Log Bases For Vector Model Weighting Technique

Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the…

Information Retrieval · Computer Science 2023-07-13 Kamel Assaf

Scalable Methods for Calculating Term Co-Occurrence Frequencies

Search techniques make use of elementary information such as term frequencies and document lengths in computation of similarity weighting. They can also exploit richer statistics, in particular the number of documents in which any two terms…

Information Retrieval · Computer Science 2020-07-20 Bodo Billerbeck , Justin Zobel , Nicholas Lester , Nick Craswell

Improving a tf-idf weighted document vector embedding

We examine a number of methods to compute a dense vector embedding for a document in a corpus, given a set of word vectors such as those from word2vec or GloVe. We describe two methods that can improve upon a simple weighted sum, that are…

Computation and Language · Computer Science 2019-02-27 Craig W. Schmidt

Using temporal IDF for efficient novelty detection in text streams

Novelty detection in text streams is a challenging task that emerges in quite a few different scenarios, ranging from email thread filtering to RSS news feed recommendation on a smartphone. An efficient novelty detection algorithm can save…

Information Retrieval · Computer Science 2014-11-11 Margarita Karkali , Francois Rousseau , Alexandros Ntoulas , Michalis Vazirgiannis

Semantic Sensitive TF-IDF to Determine Word Relevance in Documents

Keyword extraction has received an increasing attention as an important research topic which can lead to have advancements in diverse applications such as document context categorization, text indexing and document classification. In this…

Information Retrieval · Computer Science 2021-01-27 Amir Jalilifard , Vinicius F. Caridá , Alex F. Mansano , Rogers S. Cristo , Felipe Penhorate C. da Fonseca

The hypergeometric test performs comparably to TF-IDF on standard text analysis tasks

Term frequency-inverse document frequency, or TF-IDF for short, and its many variants form a class of term weighting functions the members of which are widely used in text analysis applications. While TF-IDF was originally proposed as a…

Information Retrieval · Computer Science 2023-06-06 Paul Sheridan , Mikael Onsjö

Fixed versus Dynamic Co-Occurrence Windows in TextRank Term Weights for Information Retrieval

TextRank is a variant of PageRank typically used in graphs that represent documents, and where vertices denote terms and edges denote relations between terms. Quite often the relation between terms is simple term co-occurrence within a…

Information Retrieval · Computer Science 2017-04-07 Wei Lu , Qikai Cheng , Christina Lioma

Association via Entropy Reduction

Prior to recent successes using neural networks, term frequency-inverse document frequency (tf-idf) was clearly regarded as the best choice for identifying documents related to a query. We provide a different score, aver, and observe, on a…

Information Retrieval · Computer Science 2025-11-10 Anthony Gamst , Lawrence Wilson

Finding Inverse Document Frequency Information in BERT

For many decades, BM25 and its variants have been the dominant document retrieval approach, where their two underlying features are Term Frequency (TF) and Inverse Document Frequency (IDF). The traditional approach, however, is being…

Information Retrieval · Computer Science 2022-02-25 Jaekeol Choi , Euna Jung , Sungjun Lim , Wonjong Rhee

Improving Term Frequency Normalization for Multi-topical Documents, and Application to Language Modeling Approaches

Term frequency normalization is a serious issue since lengths of documents are various. Generally, documents become long due to two different reasons - verbosity and multi-topicality. First, verbosity means that the same topic is repeatedly…

Information Retrieval · Computer Science 2015-02-10 Seung-Hoon Na , In-Su Kang , Jong-Hyeok Lee

TF-IDFC-RF: A Novel Supervised Term Weighting Scheme

Sentiment Analysis is a branch of Affective Computing usually considered a binary classification task. In this line of reasoning, Sentiment Analysis can be applied in several contexts to classify the attitude expressed in text samples, for…

Information Retrieval · Computer Science 2020-08-13 Flavio Carvalho , Gustavo Paiva Guedes

Core Lexicon and Contagious Words

We present the new empirical parameter $f_c$, the most probable usage frequency of a word in a language, computed via the distribution of documents over frequency $x$ of the word. This parameter allows for filtering the core lexicon of a…

Disordered Systems and Neural Networks · Physics 2007-05-23 Dmitri Volchenkov , Philippe Blanchard , Serge Sharoff

IDF revisited: A simple new derivation within the Robertson-Sp\"arck Jones probabilistic model

There have been a number of prior attempts to theoretically justify the effectiveness of the inverse document frequency (IDF). Those that take as their starting point Robertson and Sparck Jones's probabilistic model are based on strong or…

Information Retrieval · Computer Science 2007-05-23 Lillian Lee

Catching Unusual Traffic Behavior using TF-IDF-based Port Access Statistics Analysis

Detecting the anomalous behavior of traffic is one of the important actions for network operators. In this study, we applied term frequency - inverse document frequency (TF-IDF), which is a popular method used in natural language…

Networking and Internet Architecture · Computer Science 2021-11-12 Keiichi Shima

Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification

We provide a simple but novel supervised weighting scheme for adjusting term frequency in tf-idf for sentiment analysis and text classification. We compare our method to baseline weighting schemes and find that it outperforms them on…

Computation and Language · Computer Science 2014-07-01 Yoon Kim , Owen Zhang

Learning Term Weights for Ad-hoc Retrieval

Most Information Retrieval models compute the relevance score of a document for a given query by summing term weights specific to a document or a query. Heuristic approaches, like TF-IDF, or probabilistic models, like BM25, are used to…

Information Retrieval · Computer Science 2016-06-15 B. Piwowarski