Related papers: A Flexible Structured-based Representation for XML…

Exp\'{e}riences de classification d'une collection de documents XML de structure homog\`{e}ne

This paper presents some experiments in clustering homogeneous XMLdocuments to validate an existing classification or more generally anorganisational structure. Our approach integrates techniques for extracting knowledge from documents with…

Information Retrieval · Computer Science 2007-05-23 Thierry Despeyroux , Yves Lechevallier , Brigitte Trousse , Anne-Marie Vercoustre

Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology

This paper presents some experiments in clustering homogeneous XMLdocuments to validate an existing classification or more generally anorganisational structure. Our approach integrates techniques for extracting knowledge from documents with…

Information Retrieval · Computer Science 2007-05-23 Thierry Despeyroux , Yves Lechevallier , Brigitte Trousse , Anne-Marie Vercoustre

Document Clustering with K-tree

This paper describes the approach taken to the XML Mining track at INEX 2008 by a group at the Queensland University of Technology. We introduce the K-tree clustering algorithm in an Information Retrieval context by adapting it for document…

Information Retrieval · Computer Science 2010-01-07 Christopher M. De Vries , Shlomo Geva

Mining Semi-structured Data

The need for discovering knowledge from XML documents according to both structure and content features has become challenging, due to the increase in application contexts for which handling both structure and content information in XML data…

Databases · Computer Science 2015-04-17 Olfa Arfaoui , Minyar Sassi Hidri

Interpr\'etation vague des contraintes structurelles pour la RI dans des corpus de documents XML - \'Evaluation d'une m\'ethode approch\'ee de RI structur\'ee

We propose specific data structures designed to the indexing and retrieval of information elements in heterogeneous XML data bases. The indexing scheme is well suited to the management of various contextual searches, expressed either at a…

Information Retrieval · Computer Science 2008-12-18 Eugen Popovici , Gilbas Ménier , Pierre-François Marteau

A comparison of two suffix tree-based document clustering algorithms

Document clustering as an unsupervised approach extensively used to navigate, filter, summarize and manage large collection of document repositories like the World Wide Web (WWW). Recently, focuses in this domain shifted from traditional…

Information Retrieval · Computer Science 2012-01-11 Muhammad Rafi , M. Maujood , M. M. Fazal , S. M. Ali

Document Clustering based on Topic Maps

Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next…

Information Retrieval · Computer Science 2011-12-30 Muhammad Rafi , M. Shahid Shaikh , Amir Farooq

Automated Document Indexing via Intelligent Hierarchical Clustering: A Novel Approach

With the rising quantity of textual data available in electronic format, the need to organize it become a highly challenging task. In the present paper, we explore a document organization framework that exploits an intelligent hierarchical…

Information Retrieval · Computer Science 2015-04-02 Rajendra Kumar Roul , Shubham Rohan Asthana , Sanjay Kumar Sahay

Search Driven Analysis of Heterogenous XML Data

Analytical processing on XML repositories is usually enabled by designing complex data transformations that shred the documents into a common data warehousing schema. This can be very time-consuming and costly, especially if the underlying…

Databases · Computer Science 2009-09-15 Andrey Balmin , Latha Colby , Emiran Curtmola , Quanzhong Li , Fatma Ozcan

Optimizing XML Compression

The eXtensible Markup Language (XML) provides a powerful and flexible means of encoding and exchanging data. As it turns out, its main advantage as an encoding format (namely, its requirement that all open and close markup tags are present…

Databases · Computer Science 2015-05-13 Gregory Leighton , Denilson Barbosa

A distributed editing environment for XML documents

XML is based on two essential aspects: the modelization of data in a tree like structure and the separation between the information itself and the way it is displayed. XML structures are easily serializable. The separation between an…

Software Engineering · Computer Science 2009-02-19 Claude Pasquier , Laurent Théry

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Document parsing (DP) transforms unstructured or semi-structured documents into structured, machine-readable representations, enabling downstream applications such as knowledge base construction and retrieval-augmented generation (RAG).…

Multimedia · Computer Science 2026-04-07 Qintong Zhang , Bin Wang , Victor Shea-Jay Huang , Junyuan Zhang , Zhengren Wang , Hao Liang , Conghui He , Wentao Zhang

An XML based Document Suite

We report about the current state of development of a document suite and its applications. This collection of tools for the flexible and robust processing of documents in German is based on the use of XML as unifying formalism for encoding…

Computation and Language · Computer Science 2007-05-23 Dietmar Roesner , Manuela Kunze

Fast and Tiny Structural Self-Indexes for XML

XML document markup is highly repetitive and therefore well compressible using dictionary-based methods such as DAGs or grammars. In the context of selectivity estimation, grammar-compressed trees were used before as synopsis for structural…

Databases · Computer Science 2010-12-30 Sebastian Maneth , Tom Sebastian

Clustering Unstructured Data (Flat Files) - An Implementation in Text Mining Tool

With the advancement of technology and reduced storage costs, individuals and organizations are tending towards the usage of electronic media for storing textual information and documents. It is time consuming for readers to retrieve…

Information Retrieval · Computer Science 2010-07-27 Yasir Safeer , Atika Mustafa , Anis Noor Ali

Flexible queries in XML native databases

To date, most of the XML native databases (DB) flexible querying systems are based on exploiting the tree structure of their semi structured data (SSD). However, it becomes important to test the efficiency of Formal Concept Analysis (FCA)…

Information Retrieval · Computer Science 2013-12-09 Olfa Arfaoui , Minyar Sassi-Hidri

Detecting Structural Irregularity in Electronic Dictionaries Using Language Modeling

Dictionaries are often developed using tools that save to Extensible Markup Language (XML)-based standards. These standards often allow high-level repeating elements to represent lexical entries, and utilize descendants of these repeating…

Computation and Language · Computer Science 2016-02-18 Paul Rodrigues , David Zajic , David Doermann , Michael Bloodgood , Peng Ye

Boosting XML Filtering with a Scalable FPGA-based Architecture

The growing amount of XML encoded data exchanged over the Internet increases the importance of XML based publish-subscribe (pub-sub) and content based routing systems. The input in such systems typically consists of a stream of XML…

Hardware Architecture · Computer Science 2009-09-15 Abhishek Mitra , Marcos Vieira , Petko Bakalov , Walid Najjar , Vassilis Tsotras

Document clustering using graph based document representation with constraints

Document clustering is an unsupervised approach in which a large collection of documents (corpus) is subdivided into smaller, meaningful, identifiable, and verifiable sub-groups (clusters). Meaningful representation of documents and…

Information Retrieval · Computer Science 2014-12-08 Muhammad Rafi , Farnaz Amin , Mohammad Shahid Shaikh

Ontology Based Document Clustering Using MapReduce

Nowadays, document clustering is considered as a data intensive task due to the dramatic, fast increase in the number of available documents. Nevertheless, the features that represent those documents are also too large. The most common…

Databases · Computer Science 2015-05-13 Abdelrahman Elsayed , Hoda M. O. Mokhtar , Osama Ismail