Related papers: Word segmentation granularity in Korean

Word Segmentation for Asian Languages: Chinese, Korean, and Japanese

We provide a detailed overview of various approaches to word segmentation of Asian Languages, specifically Chinese, Korean, and Japanese languages. For each language, approaches to deal with word segmentation differs. We also include our…

Computation and Language · Computer Science 2024-07-30 Matthew Rho , Yexin Tian , Qin Chen

Integrated Eojeol Embedding for Erroneous Sentence Classification in Korean Chatbots

This paper attempts to analyze the Korean sentence classification system for a chatbot. Sentence classification is the task of classifying an input sentence based on predefined categories. However, spelling or space error contained in the…

Computation and Language · Computer Science 2021-06-08 DongHyun Choi , IlNam Park , Myeong Cheol Shin , EungGyun Kim , Dong Ryeol Shin

Chart-driven Connectionist Categorial Parsing of Spoken Korean

While most of the speech and natural language systems which were developed for English and other Indo-European languages neglect the morphological processing and integrate speech and natural language at the word level, for the agglutinative…

cmp-lg · Computer Science 2008-02-03 WonIl Lee , Geunbae Lee , Jong-Hyeok Lee

Giving Space to Your Message: Assistive Word Segmentation for the Electronic Typing of Digital Minorities

For readability and disambiguation of the written text, appropriate word segmentation is recommended for documentation, and it also holds for the digitized texts. If the language is agglutinative while far from scriptio continua, for…

Computation and Language · Computer Science 2021-05-05 Won Ik Cho , Sung Jun Cheon , Woo Hyun Kang , Ji Won Kim , Nam Soo Kim

Multi-level post-processing for Korean character recognition using morphological analysis and linguistic evaluation

Most of the post-processing methods for character recognition rely on contextual information of character and word-fragment levels. However, due to linguistic characteristics of Korean, such low-level information alone is not sufficient for…

cmp-lg · Computer Science 2008-02-03 Geunbae Lee , Jong-Hyeok Lee , JinHee Yoo

Constituency Structure over Eojeol in Korean Treebanks

The design of Korean constituency treebanks raises a fundamental representational question concerning the choice of terminal units. Although Korean words are morphologically complex, treating morphemes as constituency terminals conflates…

Computation and Language · Computer Science 2025-12-30 Jungyeul Park , Chulwoo Park

Handling Korean Out-of-Vocabulary Words with Phoneme Representation Learning

In this study, we introduce KOPL, a novel framework for handling Korean OOV words with Phoneme representation Learning. Our work is based on the linguistic property of Korean as a phonemic script, the high correlation between phonemes and…

Computation and Language · Computer Science 2025-07-08 Nayeon Kim , Eojin Jeon , Jun-Hyung Park , SangKeun Lee

Morphological annotation of Korean with Directly Maintainable Resources

This article describes an exclusively resource-based method of morphological annotation of written Korean text. Korean is an agglutinative language. Our annotator is designed to process text before the operation of a syntactic parser. In…

Computation and Language · Computer Science 2007-11-22 Ivan Berlocher , Hyun-Gue Huh , Eric Laporte , Jee-Sun Nam

K-UniMorph: Korean Universal Morphology and its Feature Schema

We present in this work a new Universal Morphology dataset for Korean. Previously, the Korean language has been underrepresented in the field of morphological paradigms amongst hundreds of diverse world languages. Hence, we propose this…

Computation and Language · Computer Science 2023-05-18 Eunkyul Leah Jo , Kyuwon Kim , Xihan Wu , KyungTae Lim , Jungyeul Park , Chulwoo Park

Investigating an Effective Character-level Embedding in Korean Sentence Classification

Different from the writing systems of many Romance and Germanic languages, some languages or language families show complex conjunct forms in character composition. For such cases where the conjuncts consist of the components representing…

Computation and Language · Computer Science 2019-09-20 Won Ik Cho , Seok Min Kim , Nam Soo Kim

A Syllable-based Technique for Word Embeddings of Korean Words

Word embedding has become a fundamental component to many NLP tasks such as named entity recognition and machine translation. However, popular models that learn such embeddings are unaware of the morphology of words, so it is not directly…

Computation and Language · Computer Science 2017-08-08 Sanghyuk Choi , Taeuk Kim , Jinseok Seol , Sang-goo Lee

Integrated speech and morphological processing in a connectionist continuous speech understanding for Korean

A new tightly coupled speech and natural language integration model is presented for a TDNN-based continuous possibly large vocabulary speech recognition system for Korean. Unlike popular n-best techniques developed for integrating mainly…

cmp-lg · Computer Science 2008-02-03 Geunbae Lee , Jong-Hyeok Lee

State-of-the-Art Vietnamese Word Segmentation

Word segmentation is the first step of any tasks in Vietnamese language processing. This paper reviews stateof-the-art approaches and systems for word segmentation in Vietnamese. To have an overview of all stages from building corpora to…

Computation and Language · Computer Science 2019-06-19 Song Nguyen Duc Cong , Quoc Hung Ngo , Rachsuda Jiamthapthaksin

Rich Character-Level Information for Korean Morphological Analysis and Part-of-Speech Tagging

Due to the fact that Korean is a highly agglutinative, character-rich language, previous work on Korean morphological analysis typically employs the use of sub-character features known as graphemes or otherwise utilizes comprehensive prior…

Computation and Language · Computer Science 2018-06-29 Andrew Matteson , Chanhee Lee , Young-Bum Kim , Heuiseok Lim

MORSE: Semantic-ally Drive-n MORpheme SEgment-er

We present in this paper a novel framework for morpheme segmentation which uses the morpho-syntactic regularities preserved by word representations, in addition to orthographic features, to segment words into morphemes. This framework is…

Computation and Language · Computer Science 2017-05-02 Tarek Sakakini , Suma Bhat , Pramod Viswanath

Using Contextual Information for Sentence-level Morpheme Segmentation

Recent advancements in morpheme segmentation primarily emphasize word-level segmentation, often neglecting the contextual relevance within the sentence. In this study, we redefine the morpheme segmentation task as a sequence-to-sequence…

Computation and Language · Computer Science 2024-12-18 Prabin Bhandari , Abhishek Paudel

Extracting Arguments from Korean Question and Command: An Annotated Corpus for Structured Paraphrasing

Intention identification is a core issue in dialog management. However, due to the non-canonicality of the spoken language, it is difficult to extract the content automatically from the conversation-style utterances. This is much more…

Computation and Language · Computer Science 2019-07-10 Won Ik Cho , Young Ki Moon , Woo Hyun Kang , Nam Soo Kim

Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences

Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and syntactic analysis or…

Computation and Language · Computer Science 2007-05-23 Rie Kubota Ando , Lillian Lee

A resource-based Korean morphological annotation system

We describe a resource-based method of morphological annotation of written Korean text. Korean is an agglutinative language. The output of our system is a graph of morphemes annotated with accurate linguistic information. The language…

Computation and Language · Computer Science 2007-11-22 Hyun-Gue Huh , Eric Laporte

Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning

Khmer text is written from left to right with optional space. Space is not served as a word boundary but instead, it is used for readability or other functional purposes. Word segmentation is a prior step for downstream tasks such as…

Computation and Language · Computer Science 2021-04-01 Rina Buoy , Nguonly Taing , Sokchea Kor