Related papers: Same or Different? Diff-Vectors for Authorship Ana…

A Step Towards Interpretable Authorship Verification

A central problem that has been researched for many years in the field of digital text forensics is the question whether two documents were written by the same author. Authorship verification (AV) is a research branch in this field that…

Computation and Language · Computer Science 2020-07-09 Oren Halvani , Lukas Graner , Roey Regev

Evaluating the Utility of Document Embedding Vector Difference for Relation Learning

Recent work has demonstrated that vector offsets obtained by subtracting pretrained word embedding vectors can be used to predict lexical relations with surprising accuracy. Inspired by this finding, in this paper, we extend the idea to the…

Computation and Language · Computer Science 2019-07-19 Jingyuan Zhang , Timothy Baldwin

SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining

Text mining and information retrieval techniques have been developed to assist us with analyzing, organizing and retrieving documents with the help of computers. In many cases, it is desirable that the authors of such documents remain…

Cryptography and Security · Computer Science 2018-05-03 Benjamin Weggenmann , Florian Kerschbaum

Experiments with Neural Networks for Small and Large Scale Authorship Verification

We propose two models for a special case of authorship verification problem. The task is to investigate whether the two documents of a given pair are written by the same author. We consider the authorship verification problem for both small…

Computation and Language · Computer Science 2018-03-20 Marjan Hosseinia , Arjun Mukherjee

The Influence of Feature Representation of Text on the Performance of Document Classification

In this paper we perform a comparative analysis of three models for feature representation of text documents in the context of document classification. In particular, we consider the most often used family of models bag-of-words, recently…

Computation and Language · Computer Science 2017-07-06 Sanda Martinčić-Ipšić , Tanja Miličić , Ljupčo Todorovski

What is the right way to represent document images?

In this article we study the problem of document image representation based on visual features. We propose a comprehensive experimental study that compares three types of visual document image representations: (1) traditional so-called…

Computer Vision and Pattern Recognition · Computer Science 2016-12-05 Gabriela Csurka , Diane Larlus , Albert Gordo , Jon Almazan

On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements

In this paper, we investigate the problem of classifying feature vectors with mutually independent but non-identically distributed elements. First, we show the importance of this problem. Next, we propose a classifier and derive an…

Machine Learning · Computer Science 2021-09-01 Farzad Shahrivari , Nikola Zlatanov

Searching for Discriminative Words in Multidimensional Continuous Feature Space

Word feature vectors have been proven to improve many NLP tasks. With recent advances in unsupervised learning of these feature vectors, it became possible to train it with much more data, which also resulted in better quality of learned…

Computation and Language · Computer Science 2022-11-29 Marius Sajgalik , Michal Barla , Maria Bielikova

Gram2Vec: An Interpretable Document Vectorizer

We present Gram2Vec, a grammatical style embedding system that embeds documents into a higher dimensional space by extracting the normalized relative frequencies of grammatical features present in the text. Compared to neural approaches,…

Computation and Language · Computer Science 2025-11-27 Peter Zeng , Hannah Stortz , Eric Sclafani , Alina Shabaeva , Maria Elizabeth Garza , Daniel Greeson , Owen Rambow

Separating Style from Substance: Enhancing Cross-Genre Authorship Attribution through Data Selection and Presentation

The task of deciding whether two documents are written by the same author is challenging for both machines and humans. This task is even more challenging when the two documents are written about different topics (e.g. baseball vs. politics)…

Computation and Language · Computer Science 2024-08-12 Steven Fincke , Elizabeth Boschee

Same Author or Just Same Topic? Towards Content-Independent Style Representations

Linguistic style is an integral component of language. Recent advances in the development of style representations have increasingly used training objectives from authorship verification (AV): Do two texts have the same author? The…

Computation and Language · Computer Science 2022-04-12 Anna Wegmann , Marijn Schraagen , Dong Nguyen

Identifying and Explaining Discriminative Attributes

Identifying what is at the center of the meaning of a word and what discriminates it from other words is a fundamental natural language inference task. This paper describes an explicit word vector representation model (WVM) to support the…

Computation and Language · Computer Science 2019-09-13 Armins Stepanjans , André Freitas

Improving Authorship Verification using Linguistic Divergence

We propose an unsupervised solution to the Authorship Verification task that utilizes pre-trained deep language models to compute a new metric called DV-Distance. The proposed metric is a measure of the difference between the two authors…

Computation and Language · Computer Science 2021-03-15 Yifan Zhang , Dainis Boumber , Marjan Hosseinia , Fan Yang , Arjun Mukherjee

Content-based Text Categorization using Wikitology

A major computational burden, while performing document clustering, is the calculation of similarity measure between a pair of documents. Similarity measure is a function that assign a real number between 0 and 1 to a pair of documents,…

Information Retrieval · Computer Science 2012-08-20 Muhammad Rafi , Sundus Hassan , Mohammad Shahid Shaikh

Single-sample writers -- "Document Filter" and their impacts on writer identification

The writing can be used as an important biometric modality which allows to unequivocally identify an individual. It happens because the writing of two different persons present differences that can be explored both in terms of graphometric…

Computer Vision and Pattern Recognition · Computer Science 2020-05-19 Fabio Pinhelli , Alceu S. Britto , Luiz S. Oliveira , Yandre M. G. Costa , Diego Bertolini

Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship

We adapt the Higher Criticism (HC) goodness-of-fit test to measure the closeness between word-frequency tables. We apply this measure to authorship attribution challenges, where the goal is to identify the author of a document using other…

Computation and Language · Computer Science 2023-10-03 Alon Kipnis

Text Classification For Authorship Attribution Analysis

Authorship attribution mainly deals with undecided authorship of literary texts. Authorship attribution is useful in resolving issues like uncertain authorship, recognize authorship of unknown texts, spot plagiarism so on. Statistical…

Digital Libraries · Computer Science 2013-10-21 M. Sudheep Elayidom , Chinchu Jose , Anitta Puthussery , Neenu K Sasi

Innovative Methods for Non-Destructive Inspection of Handwritten Documents

Handwritten document analysis is an area of forensic science, with the goal of establishing authorship of documents through examination of inherent characteristics. Law enforcement agencies use standard protocols based on manual processing…

Computer Vision and Pattern Recognition · Computer Science 2024-01-17 Eleonora Breci , Luca Guarnera , Sebastiano Battiato

Vector-based Representation is the Key: A Study on Disentanglement and Compositional Generalization

Recognizing elementary underlying concepts from observations (disentanglement) and generating novel combinations of these concepts (compositional generalization) are fundamental abilities for humans to support rapid knowledge learning and…

Computer Vision and Pattern Recognition · Computer Science 2023-05-30 Tao Yang , Yuwang Wang , Cuiling Lan , Yan Lu , Nanning Zheng

Enhancing Learning with Label Differential Privacy by Vector Approximation

Label differential privacy (DP) is a framework that protects the privacy of labels in training datasets, while the feature vectors are public. Existing approaches protect the privacy of labels by flipping them randomly, and then train a…

Machine Learning · Computer Science 2024-05-27 Puning Zhao , Rongfei Fan , Huiwen Wu , Qingming Li , Jiafei Wu , Zhe Liu