Efficient Vector Representation for Documents through Corruption

Minmin Chen

Efficient Vector Representation for Documents through Corruption

Computation and Language 2017-07-11 v1 Machine Learning

Authors: Minmin Chen

Abstract

We present an efficient document representation learning framework, Document Vector through Corruption (Doc2VecC). Doc2VecC represents each document as a simple average of word embeddings. It ensures a representation generated as such captures the semantic meanings of the document during learning. A corruption model is included, which introduces a data-dependent regularization that favors informative or rare words while forcing the embeddings of common and non-discriminative ones to be close to zero. Doc2VecC produces significantly better word embeddings than Word2Vec. We compare Doc2VecC with several state-of-the-art document representation learning algorithms. The simple model architecture introduced by Doc2VecC matches or out-performs the state-of-the-art in generating high-quality document representations for sentiment analysis, document classification as well as semantic relatedness tasks. The simplicity of the model enables training on billions of words per hour on a single machine. At the same time, the model is very efficient in generating representations of unseen documents at test time.

Keywords

word embeddings information retrieval

Cite

@article{arxiv.1707.02377,
  title  = {Efficient Vector Representation for Documents through Corruption},
  author = {Minmin Chen},
  journal= {arXiv preprint arXiv:1707.02377},
  year   = {2017}
}

Comments

5th International Conference on Learning Representations, 2017

Efficient Vector Representation for Documents through Corruption

Abstract

Keywords

Cite

Comments

Related papers