Related papers: Index wiki database: design and experiments
The problem of business-IT alignment is of widespread economic concern. As one way of addressing the problem, this paper describes an online system that functions as a kind of Wiki -- one that supports the collaborative writing and running…
In this paper we present a profile-based approach to information filtering by an analysis of the content of text documents. The Wikipedia index database is created and used to automatically generate the user profile from the user document…
Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that…
This study presents a comparative analysis of 55 Wikipedia language editions employing a citation index alongside a synthetic quality measure. Specifically, we identified the most significant Wikipedia articles within distinct topical…
Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc…
Domains like bioinformatics, version control systems, collaborative editing systems (wiki), and others, are producing huge data collections that are very repetitive. That is, there are few differences between the elements of the collection.…
Text simplification research has mostly focused on sentence-level simplification, even though many desirable edits - such as adding relevant background information or reordering content - may require document-level context. Prior work has…
With more than 11 times as many pageviews as the next largest edition, English Wikipedia dominates global knowledge access relative to other language editions. Readers are prone to assuming English Wikipedia as a superset of all language…
It is a long term desire of the computer users to minimize the communication gap between the computer and a human. On the other hand, almost all ICT applications store information in to databases and retrieve from them. Retrieving…
In the past decade, the DBpedia community has put significant amount of effort on developing technical infrastructure and methods for efficient extraction of structured information from Wikipedia. These efforts have been primarily focused…
In this work, we propose an automatic evaluation and comparison of the browsing behavior of Wikipedia readers that can be applied to any language editions of Wikipedia. As an example, we focus on English, French, and Russian languages…
Finding desired information from large data set is a difficult problem. Information retrieval is concerned with the structure, analysis, organization, storage, searching, and retrieval of information. Index is the main constituent of an IR…
This paper addresses the question of automatic data extraction from the Wiktionary, which is a multilingual and multifunctional dictionary. Wiktionary is a collaborative project working on the same principles as the Wikipedia. The…
Wiktionary is a unique, peculiar, valuable and original resource for natural language processing (NLP). The paper describes an open-source Wiktionary parser: its architecture and requirements followed by a description of Wiktionary features…
This work compares concept models for cross-language retrieval: First, we adapt probabilistic Latent Semantic Analysis (pLSA) for multilingual documents. Experiments with different weighting schemes show that a weighting method favoring…
With the growing significance of digital libraries and the Internet, more and more electronic texts become accessible to a wide and geographically disperse public. This requires adequate tools to facilitate indexing, storage, and retrieval…
Much of work in semantic web relying on Wikipedia as the main source of knowledge often work on static snapshots of the dataset. The full history of Wikipedia revisions, while contains much more useful information, is still difficult to…
In this paper we present statistical analysis of English texts from Wikipedia. We try to address the issue of language complexity empirically by comparing the simple English Wikipedia (Simple) to comparable samples of the main English…
Wikipedia is the largest existing knowledge repository that is growing on a genuine crowdsourcing support. While the English Wikipedia is the most extensive and the most researched one with over five million articles, comparatively little…
Huge numbers of new words emerge every day, leading to a great need for representing them with semantic meaning that is understandable to NLP systems. Sememes are defined as the minimum semantic units of human languages, the combination of…