Related papers: Improving a tf-idf weighted document vector embedd…
This study evaluates the performance of TF-IDF weighting, averaged Word2Vec embeddings, and BERT embeddings for document similarity scoring across two contrasting textual domains. By analysing cosine similarity scores, the methods'…
Recently, doc2vec has achieved excellent results in different tasks. In this paper, we present a context aware variant of doc2vec. We introduce a novel weight estimating mechanism that generates weights for each word occurrence according to…
The inclusion of semantic information in any similarity measures improves the efficiency of the similarity measure and provides human interpretable results for further analysis. The similarity calculation method that focuses on features…
We address the problem of tuning word embeddings for specific use cases and domains. We propose a new method that automatically combines multiple domain-specific embeddings, selected from a wide range of pre-trained domain-specific…
Graph-based extractive document summarization relies on the quality of the sentence similarity graph. Bag-of-words or tf-idf based sentence similarity uses exact word matching, but fails to measure the semantic similarity between individual…
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the…
Recent advances in neural word embedding provide significant benefit to various information retrieval tasks. However as shown by recent studies, adapting the embedding models for the needs of IR tasks can bring considerable further…
The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to…
As a fundamental task in natural language processing, word embedding converts each word into a representation in a vector space. A challenge with word embedding is that as the vocabulary grows, the vector space's dimension increases, which…
A major difficulty in applying word vector embeddings in IR is in devising an effective and efficient strategy for obtaining representations of compound units of text, such as whole documents, (in comparison to the atomic words), for the…
Latent semantic representations of words or paragraphs, namely the embeddings, have been widely applied to information retrieval (IR). One of the common approaches of utilizing embeddings for IR is to estimate the document-to-query (D2Q)…
Keyword extraction has received an increasing attention as an important research topic which can lead to have advancements in diverse applications such as document context categorization, text indexing and document classification. In this…
Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word2vec (Mikolov et al., 2013a) to learn document-level embeddings. Despite promising results in the original paper, others have struggled to reproduce those results. This…
Sentiment analysis is one of the well-known tasks and fast growing research areas in natural language processing (NLP) and text classifications. This technique has become an essential part of a wide range of applications including politics,…
A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words…
Automatic measurement of semantic text similarity is an important task in natural language processing. In this paper, we evaluate the performance of different vector space models to perform this task. We address the real-world problem of…
Despite the success of distributional semantics, composing phrases from word vectors remains an important challenge. Several methods have been tried for benchmark tasks such as sentiment classification, including word vector averaging,…
While the celebrated Word2Vec technique yields semantically rich representations for individual words, there has been relatively less success in extending to generate unsupervised sentences or documents embeddings. Recent work has…
Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval…
Since the amount of information on the internet is growing rapidly, it is not easy for a user to find relevant information for his/her query. To tackle this issue, much attention has been paid to Automatic Document Summarization. The key…