Related papers: Determining the Unithood of Word Sequences using a…

Determining the Unithood of Word Sequences using Mutual Information and Independence Measure

Most works related to unithood were conducted as part of a larger effort for the determination of termhood. Consequently, the number of independent research that study the notion of unithood and produce dedicated techniques for measuring…

Artificial Intelligence · Computer Science 2008-10-02 Wilson Wong , Wei Liu , Mohammed Bennamoun

Uncertainty in Neural Network Word Embedding: Exploration of Threshold for Similarity

Word embedding, specially with its recent developments, promises a quantification of the similarity between terms. However, it is not clear to which extent this similarity value can be genuinely meaningful and useful for subsequent tasks.…

Computation and Language · Computer Science 2018-04-05 Navid Rekabsaz , Mihai Lupu , Allan Hanbury

Probabilistic Method of Measuring Linguistic Productivity

In this paper I propose a new way of measuring linguistic productivity that objectively assesses the ability of an affix to be used to coin new complex words and, unlike other popular measures, is not directly dependent upon token…

Computation and Language · Computer Science 2023-08-25 Sergei Monakhov

Measuring memorization in language models via probabilistic extraction

Large language models (LLMs) are susceptible to memorizing training data, raising concerns about the potential extraction of sensitive information at generation time. Discoverable extraction is the most common method for measuring this…

Machine Learning · Computer Science 2025-03-21 Jamie Hayes , Marika Swanberg , Harsh Chaudhari , Itay Yona , Ilia Shumailov , Milad Nasr , Christopher A. Choquette-Choo , Katherine Lee , A. Feder Cooper

Joint Semantic Synthesis and Morphological Analysis of the Derived Word

Much like sentences are composed of words, words themselves are composed of smaller units. For example, the English word questionably can be analyzed as question+able+ly. However, this structural decomposition of the word does not directly…

Computation and Language · Computer Science 2018-11-13 Ryan Cotterell , Hinrich Schütze

A Study of Metrics of Distance and Correlation Between Ranked Lists for Compositionality Detection

Compositionality in language refers to how much the meaning of some phrase can be decomposed into the meaning of its constituents and the way these constituents are combined. Based on the premise that substitution by synonyms is…

Computation and Language · Computer Science 2017-03-13 Christina Lioma , Niels Dalum Hansen

Apportioning Development Effort in a Probabilistic LR Parsing System through Evaluation

We describe an implemented system for robust domain-independent syntactic parsing of English, using a unification-based grammar of part-of-speech and punctuation labels coupled with a probabilistic LR parser. We present evaluations of the…

cmp-lg · Computer Science 2008-02-03 John Carroll , Ted Briscoe

A Comprehensive Comparative Study of Word and Sentence Similarity Measures

Sentence similarity is considered the basis of many natural language tasks such as information retrieval, question answering and text summarization. The semantic meaning between compared text fragments is based on the words semantic…

Information Retrieval · Computer Science 2016-10-17 Issa Atoum , Ahmed Otoom , Narayanan Kulathuramaiyer

Developing and Evaluating a Probabilistic LR Parser of Part-of-Speech and Punctuation Labels

We describe an approach to robust domain-independent syntactic parsing of unrestricted naturally-occurring (English) input. The technique involves parsing sequences of part-of-speech and punctuation labels using a unification-based grammar…

cmp-lg · Computer Science 2008-02-03 Ted Briscoe , John Carroll

Toward Network-based Keyword Extraction from Multitopic Web Documents

In this paper we analyse the selectivity measure calculated from the complex network in the task of the automatic keyword extraction. Texts, collected from different web sources (portals, forums), are represented as directed and weighted…

Computation and Language · Computer Science 2014-07-15 Sabina Šišović , Sanda Martinčić-Ipšić , Ana Meštrović

Locally Typical Sampling

Today's probabilistic language generators fall short when it comes to producing coherent and fluent text despite the fact that the underlying models perform well under standard metrics, e.g., perplexity. This discrepancy has puzzled the…

Computation and Language · Computer Science 2025-06-06 Clara Meister , Tiago Pimentel , Gian Wiher , Ryan Cotterell

Implicit Representations of Grammaticality in Language Models

Grammaticality and likelihood are distinct notions in human language. Pretrained language models (LMs), which are probabilistic models of language fitted to maximize corpus likelihood, generate grammatically well-formed text and…

Computation and Language · Computer Science 2026-05-07 Yingshan Susan Wang , Linlu Qiu , Zhaofeng Wu , Roger P. Levy , Yoon Kim

Searching for PETs: Using Distributional and Sentiment-Based Methods to Find Potentially Euphemistic Terms

This paper presents a linguistically driven proof of concept for finding potentially euphemistic terms, or PETs. Acknowledging that PETs tend to be commonly used expressions for a certain range of sensitive topics, we make use of…

Computation and Language · Computer Science 2022-05-24 Patrick Lee , Martha Gavidia , Anna Feldman , Jing Peng

In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and…

Computation and Language · Computer Science 2007-05-23 Ido Dagan , Lillian Lee , Fernando C. N. Pereira

Web-based Semantic Similarity for Emotion Recognition in Web Objects

In this project we propose a new approach for emotion recognition using web-based similarity (e.g. confidence, PMI and PMING). We aim to extract basic emotions from short sentences with emotional content (e.g. news titles, tweets,…

Computation and Language · Computer Science 2017-01-12 Valentina Franzoni , Giulio Biondi , Alfredo Milani , Yuanxi Li

Scalable Methods for Calculating Term Co-Occurrence Frequencies

Search techniques make use of elementary information such as term frequencies and document lengths in computation of similarity weighting. They can also exploit richer statistics, in particular the number of documents in which any two terms…

Information Retrieval · Computer Science 2020-07-20 Bodo Billerbeck , Justin Zobel , Nicholas Lester , Nick Craswell

UsingWord Embedding for Cross-Language Plagiarism Detection

This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity…

Computation and Language · Computer Science 2017-02-13 J. Ferrero , F. Agnes , L. Besacier , D. Schwab

Learning Probabilistic Sentence Representations from Paraphrases

Probabilistic word embeddings have shown effectiveness in capturing notions of generality and entailment, but there is very little work on doing the analogous type of investigation for sentences. In this paper we define probabilistic models…

Computation and Language · Computer Science 2020-05-19 Mingda Chen , Kevin Gimpel

An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery

This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text.…

Computation and Language · Computer Science 2007-05-23 Michael R. Brent

Measuring Sentences Similarity: A Survey

This study is to review the approaches used for measuring sentences similarity. Measuring similarity between natural language sentences is a crucial task for many Natural Language Processing applications such as text classification,…

Computation and Language · Computer Science 2019-10-10 Mamdouh Farouk