Related papers: Hierarchical Document Encoder for Parallel Corpus …

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

This paper presents an effective approach for parallel corpus mining using bilingual sentence embeddings. Our embedding models are trained to produce similar representations exclusively for bilingual sentence pairs that are translations of…

Computation and Language · Computer Science 2018-08-03 Mandy Guo , Qinlan Shen , Yinfei Yang , Heming Ge , Daniel Cer , Gustavo Hernandez Abrego , Keith Stevens , Noah Constant , Yun-Hsuan Sung , Brian Strope , Ray Kurzweil

Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax

In this paper, we present an approach to learn multilingual sentence embeddings using a bi-directional dual-encoder with additive margin softmax. The embeddings are able to achieve state-of-the-art results on the United Nations (UN)…

Computation and Language · Computer Science 2019-06-18 Yinfei Yang , Gustavo Hernandez Abrego , Steve Yuan , Mandy Guo , Qinlan Shen , Daniel Cer , Yun-hsuan Sung , Brian Strope , Ray Kurzweil

Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?

Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art results in various tasks requiring semantic understanding.…

Computation and Language · Computer Science 2023-07-06 Sonal Sannigrahi , Josef van Genabith , Cristina Espana-Bonet

Multilingual Word Embeddings using Multigraphs

We present a family of neural-network--inspired models for computing continuous word representations, specifically designed to exploit both monolingual and multilingual text. This framework allows us to perform unsupervised training of…

Computation and Language · Computer Science 2016-12-15 Radu Soricut , Nan Ding

Multilingual Hierarchical Attention Networks for Document Classification

Hierarchical attention networks have recently achieved remarkable performance for document classification in a given language. However, when multilingual document collections are considered, training such models separately for each language…

Computation and Language · Computer Science 2017-09-18 Nikolaos Pappas , Andrei Popescu-Belis

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on…

Computation and Language · Computer Science 2021-05-24 Ivana Kvapilıkova , Mikel Artetxe , Gorka Labaka , Eneko Agirre , Ondřej Bojar

Contextual Document Embeddings

Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are…

Computation and Language · Computer Science 2024-11-11 John X. Morris , Alexander M. Rush

Hierarchical corpus encoder: Fusing generative retrieval and dense indices

Generative retrieval employs sequence models for conditional generation of document IDs based on a query (DSI (Tay et al., 2022); NCI (Wang et al., 2022); inter alia). While this has led to improved performance in zero-shot retrieval, it is…

Information Retrieval · Computer Science 2025-02-27 Tongfei Chen , Ankita Sharma , Adam Pauls , Benjamin Van Durme

Cross-Modal and Hierarchical Modeling of Video and Text

Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a…

Computer Vision and Pattern Recognition · Computer Science 2018-10-18 Bowen Zhang , Hexiang Hu , Fei Sha

Hierarchical Meta-Embeddings for Code-Switching Named Entity Recognition

In countries that speak multiple main languages, mixing up different languages within a conversation is commonly called code-switching. Previous works addressing this challenge mainly focused on word-level aspects such as word embeddings.…

Computation and Language · Computer Science 2019-09-19 Genta Indra Winata , Zhaojiang Lin , Jamin Shin , Zihan Liu , Pascale Fung

A General-Purpose Multilingual Document Encoder

Massively multilingual pretrained transformers (MMTs) have tremendously pushed the state of the art on multilingual NLP and cross-lingual transfer of NLP models in particular. While a large body of work leveraged MMTs to mine parallel data…

Computation and Language · Computer Science 2023-05-12 Onur Galoğlu , Robert Litschko , Goran Glavaš

An Analysis of Hierarchical Text Classification Using Word Embeddings

Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the…

Computation and Language · Computer Science 2018-09-07 Roger A. Stein , Patricia A. Jaques , Joao F. Valiati

A Multi-Resolution Word Embedding for Document Retrieval from Large Unstructured Knowledge Bases

Deep language models learning a hierarchical representation proved to be a powerful tool for natural language processing, text mining and information retrieval. However, representations that perform well for retrieval must capture semantic…

Information Retrieval · Computer Science 2019-05-24 Tolgahan Cakaloglu , Xiaowei Xu

Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that…

Computation and Language · Computer Science 2021-10-22 Takashi Wada , Tomoharu Iwata , Yuji Matsumoto , Timothy Baldwin , Jey Han Lau

Bilingual Distributed Word Representations from Document-Aligned Comparable Data

We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual…

Computation and Language · Computer Science 2016-03-01 Ivan Vulić , Marie-Francine Moens

Language Model Pre-training for Hierarchical Document Representations

Hierarchical neural architectures are often used to capture long-distance dependencies and have been applied to many document-level tasks such as summarization, document segmentation, and sentiment analysis. However, effective usage of such…

Computation and Language · Computer Science 2019-01-29 Ming-Wei Chang , Kristina Toutanova , Kenton Lee , Jacob Devlin

Optimizing Sentence Embedding with Pseudo-Labeling and Model Ensembles: A Hierarchical Framework for Enhanced NLP Tasks

Sentence embedding tasks are important in natural language processing (NLP), but improving their performance while keeping them reliable is still hard. This paper presents a framework that combines pseudo-label generation and model ensemble…

Computation and Language · Computer Science 2025-01-28 Ziwei Liu , Qi Zhang , Lifu Gao

Hamming Sentence Embeddings for Information Retrieval

In retrieval applications, binary hashes are known to offer significant improvements in terms of both memory and speed. We investigate the compression of sentence embeddings using a neural encoder-decoder architecture, which is trained by…

Information Retrieval · Computer Science 2019-08-16 Felix Hamann , Nadja Kurz , Adrian Ulges

Leveraging Closed-Access Multilingual Embedding for Automatic Sentence Alignment in Low Resource Languages

The importance of qualitative parallel data in machine translation has long been determined but it has always been very difficult to obtain such in sufficient quantity for the majority of world languages, mainly because of the associated…

Computation and Language · Computer Science 2023-11-22 Idris Abdulmumin , Auwal Abubakar Khalid , Shamsuddeen Hassan Muhammad , Ibrahim Said Ahmad , Lukman Jibril Aliyu , Babangida Sani , Bala Mairiga Abduljalil , Sani Ahmad Hassan

Hybrid Improved Document-level Embedding (HIDE)

In recent times, word embeddings are taking a significant role in sentiment analysis. As the generation of word embeddings needs huge corpora, many applications use pretrained embeddings. In spite of the success, word embeddings suffers…

Computation and Language · Computer Science 2020-06-03 Satanik Mitra , Mamata Jenamani