Related papers: Index wiki database: design and experiments

A Wiki for Business Rules in Open Vocabulary, Executable English

The problem of business-IT alignment is of widespread economic concern. As one way of addressing the problem, this paper describes an online system that functions as a kind of Wiki -- one that supports the collaborative writing and running…

Artificial Intelligence · Computer Science 2011-03-04 Adrian Walker

Information filtering based on wiki index database

In this paper we present a profile-based approach to information filtering by an analysis of the content of text documents. The Wikipedia index database is created and used to automatically generate the user profile from the user document…

Information Retrieval · Computer Science 2008-05-08 A. V. Smirnov , A. A. Krizhanovsky

Universal Indexes for Highly Repetitive Document Collections

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that…

Information Retrieval · Computer Science 2016-05-25 Francisco Claude , Antonio Fariña , Miguel A. Martínez-Prieto , Gonzalo Navarro

Utilizing citation index and synthetic quality measure to compare Wikipedia languages across various topics

This study presents a comparative analysis of 55 Wikipedia language editions employing a citation index alongside a synthetic quality measure. Specifically, we identified the most significant Wikipedia articles within distinct topical…

Information Retrieval · Computer Science 2025-05-23 Włodzimierz Lewoniewski , Krzysztof Węcel , Witold Abramowicz

WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset

Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc…

Information Retrieval · Computer Science 2020-03-18 Jibril Frej , Didier Schwab , Jean-Pierre Chevallet

Self-Index based on LZ77 (thesis)

Domains like bioinformatics, version control systems, collaborative editing systems (wiki), and others, are producing huge data collections that are very repetitive. That is, there are few differences between the elements of the collection.…

Data Structures and Algorithms · Computer Science 2011-12-21 Sebastian Kreft , Gonzalo Navarro

SWiPE: A Dataset for Document-Level Simplification of Wikipedia Pages

Text simplification research has mostly focused on sentence-level simplification, even though many desirable edits - such as adding relevant background information or reordering content - may require document-level context. Prior work has…

Computation and Language · Computer Science 2023-05-31 Philippe Laban , Jesse Vig , Wojciech Kryscinski , Shafiq Joty , Caiming Xiong , Chien-Sheng Wu

WikiGap: Promoting Epistemic Equity by Surfacing Knowledge Gaps Between English Wikipedia and other Language Editions

With more than 11 times as many pageviews as the next largest edition, English Wikipedia dominates global knowledge access relative to other language editions. Readers are prone to assuming English Wikipedia as a superset of all language…

Human-Computer Interaction · Computer Science 2026-01-21 Zining Wang , Yuxuan Zhang , Dongwook Yoon , Nicholas Vincent , Farhan Samir , Vered Shwartz

Natural Language Web Interface for Database (NLWIDB)

It is a long term desire of the computer users to minimize the communication gap between the computer and a human. On the other hand, almost all ICT applications store information in to databases and retrieve from them. Retrieving…

Computation and Language · Computer Science 2013-08-20 Rukshan Alexander , Prashanthi Rukshan , Sinnathamby Mahesan

DBpedia NIF: Open, Large-Scale and Multilingual Knowledge Extraction Corpus

In the past decade, the DBpedia community has put significant amount of effort on developing technical infrastructure and methods for efficient extraction of structured information from Wikipedia. These efforts have been primarily focused…

Computation and Language · Computer Science 2018-12-27 Milan Dojchinovski , Julio Hernandez , Markus Ackermann , Amit Kirschenbaum , Sebastian Hellmann

In this work, we propose an automatic evaluation and comparison of the browsing behavior of Wikipedia readers that can be applied to any language editions of Wikipedia. As an example, we focus on English, French, and Russian languages…

Social and Information Networks · Computer Science 2020-02-18 Volodymyr Miz , Joëlle Hanna , Nicolas Aspert , Benjamin Ricaud , Pierre Vandergheynst

A New Compression Based Index Structure for Efficient Information Retrieval

Finding desired information from large data set is a difficult problem. Information retrieval is concerned with the structure, analysis, organization, storage, searching, and retrieval of information. Index is the main constituent of an IR…

Information Retrieval · Computer Science 2012-09-26 Md. Abdullah al Mamun , Md. Hanif , Md. Rakib Uddin , Tanvir Ahmed , Md. Mofizul Islam

Transformation of Wiktionary entry structure into tables and relations in a relational database schema

This paper addresses the question of automatic data extraction from the Wiktionary, which is a multilingual and multifunctional dictionary. Wiktionary is a collaborative project working on the same principles as the Wikipedia. The…

Information Retrieval · Computer Science 2010-11-08 A. A. Krizhanovsky

The comparison of Wiktionary thesauri transformed into the machine-readable format

Wiktionary is a unique, peculiar, valuable and original resource for natural language processing (NLP). The paper describes an open-source Wiktionary parser: its architecture and requirements followed by a description of Wiktionary features…

Information Retrieval · Computer Science 2010-06-28 A. A. Krizhanovsky

Assessing Wikipedia-Based Cross-Language Retrieval Models

This work compares concept models for cross-language retrieval: First, we adapt probabilistic Latent Semantic Analysis (pLSA) for multilingual documents. Experiments with different weighting schemes show that a weighting method favoring…

Information Retrieval · Computer Science 2014-01-13 Benjamin Roth

Semi-Automatic Indexing of Multilingual Documents

With the growing significance of digital libraries and the Internet, more and more electronic texts become accessible to a wide and geographically disperse public. This requires adequate tools to facilitate indexing, storage, and retrieval…

Digital Libraries · Computer Science 2007-05-23 Ulrich Schiel , Ianna M. Sodre Ferreira de Souza , Edberto Ferneda

Hedera: Scalable Indexing and Exploring Entities in Wikipedia Revision History

Much of work in semantic web relying on Wikipedia as the main source of knowledge often work on static snapshots of the dataset. The full history of Wikipedia revisions, while contains much more useful information, is still difficult to…

Artificial Intelligence · Computer Science 2017-01-17 Tuan Tran , Tu Ngoc Nguyen

A practical approach to language complexity: a Wikipedia case study

In this paper we present statistical analysis of English texts from Wikipedia. We try to address the issue of language complexity empirically by comparing the simple English Wikipedia (Simple) to comparable samples of the main English…

Computation and Language · Computer Science 2023-01-05 Taha Yasseri , András Kornai , János Kertész

Robust clustering of languages across Wikipedia growth

Wikipedia is the largest existing knowledge repository that is growing on a genuine crowdsourcing support. While the English Wikipedia is the most extensive and the most researched one with over five million articles, comparatively little…

Digital Libraries · Computer Science 2017-10-20 Kristina Ban , Matjaz Perc , Zoran Levnajic

Sememe Prediction: Learning Semantic Knowledge from Unstructured Textual Wiki Descriptions

Huge numbers of new words emerge every day, leading to a great need for representing them with semantic meaning that is understandable to NLP systems. Sememes are defined as the minimum semantic units of human languages, the combination of…

Computation and Language · Computer Science 2018-08-17 Wei Li , Xuancheng Ren , Damai Dai , Yunfang Wu , Houfeng Wang , Xu Sun