Related papers: Towards Explaining STEM Document Classification us…

Towards Math-Aware Automated Classification and Similarity Search of Scientific Publications: Methods of Mathematical Content Representations

In this paper, we investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents using standard machine learning algorithms: the Latent Dirichlet Allocation (LDA)…

Information Retrieval · Computer Science 2021-10-11 Michal Růžička , Petr Sojka

Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language

In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and…

Digital Libraries · Computer Science 2020-05-25 Philipp Scharpf , Moritz Schubotz , Abdou Youssef , Felix Hamborg , Norman Meuschke , Bela Gipp

Towards Semantically Enhanced Data Understanding

In the field of machine learning, data understanding is the practice of getting initial insights in unknown datasets. Such knowledge-intensive tasks require a lot of documentation, which is necessary for data scientists to grasp the meaning…

Databases · Computer Science 2018-06-14 Markus Schröder , Christian Jilek , Jörn Hees , Andreas Dengel

Using General Large Language Models to Classify Mathematical Documents

In this article we report on an initial exploration to assess the viability of using the general large language models (LLMs), recently made public, to classify mathematical documents. Automated classification would be useful from the…

Information Retrieval · Computer Science 2024-06-18 Patrick D. F. Ion , Stephen M. Watt

STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing

Advances in large language models (LLMs) have spurred research into enhancing their reasoning capabilities, particularly in math-rich STEM (Science, Technology, Engineering, and Mathematics) documents. While LLMs can generate equations or…

Computation and Language · Computer Science 2025-06-03 Jiaru Zou , Qing Wang , Pratyush Thakur , Nickvash Kani

Understanding the Logical and Semantic Structure of Large Documents

Current language understanding approaches focus on small documents, such as newswire articles, blog posts, product reviews and discussion forum entries. Understanding and extracting information from large documents like legal briefs,…

Computation and Language · Computer Science 2017-09-05 Muhammad Mahbubur Rahman , Tim Finin

Text Classification Models for Form Entity Linking

Forms are a widespread type of template-based document used in a great variety of fields including, among others, administration, medicine, finance, or insurance. The automatic extraction of the information included in these documents is…

Computation and Language · Computer Science 2021-12-15 María Villota , César Domínguez , Jónathan Heras , Eloy Mata , Vico Pascual

A Framework for Explainable Text Classification in Legal Document Review

Companies regularly spend millions of dollars producing electronically-stored documents in legal matters. Recently, parties on both sides of the 'legal aisle' are accepting the use of machine learning techniques like text classification to…

Information Retrieval · Computer Science 2019-12-23 Christian J. Mahoney , Jianping Zhang , Nathaniel Huber-Fliflet , Peter Gronvall , Haozhen Zhao

Explainable Text Classification Techniques in Legal Document Review: Locating Rationales without Using Human Annotated Training Text Snippets

US corporations regularly spend millions of dollars reviewing electronically-stored documents in legal matters. Recently, attorneys apply text classification to efficiently cull massive volumes of data to identify responsive documents for…

Information Retrieval · Computer Science 2023-11-16 Christian Mahoney , Peter Gronvall , Nathaniel Huber-Fliflet , Jianping Zhang

Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However,…

Digital Libraries · Computer Science 2019-06-28 Norman Meuschke , Vincent Stange , Moritz Schubotz , Michael Karmer , Bela Gipp

A New Approach Towards Autoformalization

Verifying mathematical proofs is difficult, but can be automated with the assistance of a computer. Autoformalization is the task of automatically translating natural language mathematics into a formal language that can be verified by a…

Computation and Language · Computer Science 2024-07-11 Nilay Patel , Rahul Saha , Jeffrey Flanigan

Semantic Document Clustering on Named Entity Features

Keyword-based information processing has limitations due to simple treatment of words. In this paper, we introduce named entities as objectives into document clustering, which are the key elements defining document semantics and in many…

Information Retrieval · Computer Science 2018-07-23 Tru H. Cao , Vuong M. Ngo , Dung T. Hong , Tho T. Quan

Structural Regularities in Text-based Entity Vector Spaces

Entity retrieval is the task of finding entities such as people or products in response to a query, based solely on the textual documents they are associated with. Recent semantic entity retrieval algorithms represent queries and experts in…

Information Retrieval · Computer Science 2017-07-26 Christophe Van Gysel , Maarten de Rijke , Evangelos Kanoulas

Classifying text using machine learning models and determining conversation drift

Text classification helps analyse texts for semantic meaning and relevance, by mapping the words against this hierarchy. An analysis of various types of texts is invaluable to understanding both their semantic meaning, as well as their…

Machine Learning · Computer Science 2022-11-16 Chaitanya Chadha , Vandit Gupta , Deepak Gupta , Ashish Khanna

Towards the Improvement of Automated Scientific Document Categorization by Deep Learning

This master thesis describes an algorithm for automated categorization of scientific documents using deep learning techniques and compares the results to the results of existing classification algorithms. As an additional goal a reusable…

Information Retrieval · Computer Science 2017-06-20 Thomas Krause

Automatic explanation of the classification of Spanish legal judgments in jurisdiction-dependent law categories with tree estimators

Automatic legal text classification systems have been proposed in the literature to address knowledge extraction from judgments and detect their aspects. However, most of these systems are black boxes even when their models are…

Computation and Language · Computer Science 2024-04-02 Jaime González-González , Francisco de Arriba-Pérez , Silvia García-Méndez , Andrea Busto-Castiñeira , Francisco J. González-Castaño

What Makes it Difficult to Understand a Scientific Literature?

In the artificial intelligence area, one of the ultimate goals is to make computers understand human language and offer assistance. In order to achieve this ideal, researchers of computer science have put forward a lot of models and…

Computation and Language · Computer Science 2015-12-07 Mengyun Cao , Jiao Tian , Dezhi Cheng , Jin Liu , Xiaoping Sun

New Datasets and a Benchmark of Document Network Embedding Methods for Scientific Expert Finding

The scientific literature is growing faster than ever. Finding an expert in a particular scientific domain has never been as hard as today because of the increasing amount of publications and because of the ever growing diversity of…

Information Retrieval · Computer Science 2020-04-09 Robin Brochier , Antoine Gourru , Adrien Guille , Julien Velcin

Comparative Study of Long Document Classification

The amount of information stored in the form of documents on the internet has been increasing rapidly. Thus it has become a necessity to organize and maintain these documents in an optimum manner. Text classification algorithms study the…

Computation and Language · Computer Science 2022-02-22 Vedangi Wagh , Snehal Khandve , Isha Joshi , Apurva Wani , Geetanjali Kale , Raviraj Joshi

Document classification methods

Information on different fields which are collected by users requires appropriate management and organization to be structured in a standard way and retrieved fast and more easily. Document classification is a conventional method to…

Information Retrieval · Computer Science 2019-09-18 Madjid Khalilian , Shiva Hassanzadeh