Related papers: A Simple and Efficient Probabilistic Language mode…

Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text

Natural language processing (NLP) techniques have become mainstream in the recent decade. Most of these advances are attributed to the processing of a single language. More recently, with the extensive growth of social media platforms focus…

Computation and Language · Computer Science 2022-01-12 Ramchandra Joshi , Raviraj Joshi

Language Identification of Hindi-English tweets using code-mixed BERT

Language identification of social media text has been an interesting problem of study in recent years. Social media messages are predominantly in code mixed in non-English speaking states. Prior knowledge by pre-training contextual…

Computation and Language · Computer Science 2021-07-05 Mohd Zeeshan Ansari , M M Sufyan Beg , Tanvir Ahmad , Mohd Jazib Khan , Ghazali Wasim

Learning Cross-lingual Embeddings from Twitter via Distant Supervision

Cross-lingual embeddings represent the meaning of words from different languages in the same vector space. Recent work has shown that it is possible to construct such representations by aligning independently learned monolingual embedding…

Computation and Language · Computer Science 2020-04-02 Jose Camacho-Collados , Yerai Doval , Eugenio Martínez-Cámara , Luis Espinosa-Anke , Francesco Barbieri , Steven Schockaert

Language Detection For Short Text Messages In Social Media

With the constant growth of the World Wide Web and the number of documents in different languages accordingly, the need for reliable language detection tools has increased as well. Platforms such as Twitter with predominantly short texts…

Computation and Language · Computer Science 2016-08-31 Ivana Balazevic , Mikio Braun , Klaus-Robert Müller

Learning Semantic Similarity for Very Short Texts

Levering data on social media, such as Twitter and Facebook, requires information retrieval algorithms to become able to relate very short text fragments to each other. Traditional text similarity methods such as tf-idf cosine-similarity,…

Information Retrieval · Computer Science 2015-12-03 Cedric De Boom , Steven Van Canneyt , Steven Bohez , Thomas Demeester , Bart Dhoedt

Language Identification of Bengali-English Code-Mixed data using Character & Phonetic based LSTM Models

Language identification of social media text still remains a challenging task due to properties like code-mixing and inconsistent phonetic transliterations. In this paper, we present a supervised learning approach for language…

Computation and Language · Computer Science 2018-06-28 Soumil Mandal , Sourya Dipta Das , Dipankar Das

Automatic Normalization of Word Variations in Code-Mixed Social Media Text

Social media platforms such as Twitter and Facebook are becoming popular in multilingual societies. This trend induces portmanteau of South Asian languages with English. The blend of multiple languages as code-mixed data has recently become…

Computation and Language · Computer Science 2024-03-08 Rajat Singh , Nurendra Choudhary , Manish Shrivastava

Recurrent-Neural-Network for Language Detection on Twitter Code-Switching Corpus

Mixed language data is one of the difficult yet less explored domains of natural language processing. Most research in fields like machine translation or sentiment analysis assume monolingual input. However, people who are capable of using…

Neural and Evolutionary Computing · Computer Science 2014-12-23 Joseph Chee Chang , Chu-Cheng Lin

Leveraging Word Embeddings for Spoken Document Summarization

Owing to the rapidly growing multimedia content available on the Internet, extractive spoken document summarization, with the purpose of automatically selecting a set of representative sentences from a spoken document to concisely express…

Computation and Language · Computer Science 2015-06-16 Kuan-Yu Chen , Shih-Hung Liu , Hsin-Min Wang , Berlin Chen , Hsin-Hsi Chen

Representation learning for very short texts using weighted word embedding aggregation

Short text messages such as tweets are very noisy and sparse in their use of vocabulary. Traditional textual representations, such as tf-idf, have difficulty grasping the semantic meaning of such texts, which is important in applications…

Information Retrieval · Computer Science 2016-07-05 Cedric De Boom , Steven Van Canneyt , Thomas Demeester , Bart Dhoedt

Unsupervised Cross-lingual Word Embedding by Multilingual Neural Language Models

We propose an unsupervised method to obtain cross-lingual embeddings without any parallel data or pre-trained word embeddings. The proposed model, which we call multilingual neural language models, takes sentences of multiple languages as…

Computation and Language · Computer Science 2018-09-10 Takashi Wada , Tomoharu Iwata

gundapusunil at SemEval-2020 Task 9: Syntactic Semantic LSTM Architecture for SENTIment Analysis of Code-MIXed Data

The phenomenon of mixing the vocabulary and syntax of multiple languages within the same utterance is called Code-Mixing. This is more evident in multilingual societies. In this paper, we have developed a system for SemEval 2020: Task 9 on…

Computation and Language · Computer Science 2020-10-12 Sunil Gundapu , Radhika Mamidi

Feature Selection on Noisy Twitter Short Text Messages for Language Identification

The task of written language identification involves typically the detection of the languages present in a sample of text. Moreover, a sequence of text may not belong to a single inherent language but also may be mixture of text written in…

Computation and Language · Computer Science 2020-07-14 Mohd Zeeshan Ansari , Tanvir Ahmad , Ana Fatima

Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research

Conventional text classification models make a bag-of-words assumption reducing text into word occurrence counts per document. Recent algorithms such as word2vec are capable of learning semantic meaning and similarity between words in an…

Computation and Language · Computer Science 2018-07-11 Vincent Major , Alisa Surkis , Yindalon Aphinyanaphongs

A Resource-Light Method for Cross-Lingual Semantic Textual Similarity

Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for…

Computation and Language · Computer Science 2018-01-22 Goran Glavaš , Marc Franco-Salvador , Simone Paolo Ponzetto , Paolo Rosso

Linking Tweets with Monolingual and Cross-Lingual News using Transformed Word Embeddings

Social media platforms have grown into an important medium to spread information about an event published by the traditional media, such as news articles. Grouping such diverse sources of information that discuss the same topic in varied…

Computation and Language · Computer Science 2017-10-26 Aditya Mogadala , Dominik Jung , Achim Rettinger

Comparative Analysis of Word Embeddings for Capturing Word Similarities

Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning…

Computation and Language · Computer Science 2020-05-11 Martina Toshevska , Frosina Stojanovska , Jovan Kalajdjieski

Learning Word Embeddings from Intrinsic and Extrinsic Views

While word embeddings are currently predominant for natural language processing, most of existing models learn them solely from their contexts. However, these context-based word embeddings are limited since not all words' meaning can be…

Computation and Language · Computer Science 2016-08-23 Jifan Chen , Kan Chen , Xipeng Qiu , Qi Zhang , Xuanjing Huang , Zheng Zhang

Code-mixed Sentiment and Hate-speech Prediction

Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring…

Computation and Language · Computer Science 2025-04-16 Anjali Yadav , Tanya Garg , Matej Klemen , Matej Ulcar , Basant Agarwal , Marko Robnik Sikonja

Interpretable probabilistic embeddings: bridging the gap between topic models and neural networks

We consider probabilistic topic models and more recent word embedding techniques from a perspective of learning hidden semantic representations. Inspired by a striking similarity of the two approaches, we merge them and learn probabilistic…

Computation and Language · Computer Science 2017-11-15 Anna Potapenko , Artem Popov , Konstantin Vorontsov