Related papers: Web Archive Analytics
Common Crawl is a multi-petabyte longitudinal dataset containing over 100 billion web pages which is widely used as a source of language data for sequence model training and in web science research. Each of its constituent archives is on…
Curated web archive collections contain focused digital content which is collected by archiving organizations, groups, and individuals to provide a representative sample covering specific topics and events to preserve them for future…
Curated web archive collections contain focused digital contents which are collected by archiving organizations to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In…
Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller…
Web archives preserve unique and historically valuable information. They hold a record of past events and memories published by all kinds of people, such as journalists, politicians and ordinary people who have shared their testimony and…
Web archives capture the history of the Web and are therefore an important source to study how societal developments have been reflected on the Web. However, the large size of Web archives and their temporal nature pose many challenges to…
The field of web archiving provides a unique mix of human and automated agents collaborating to achieve the preservation of the web. Centuries old theories of archival appraisal are being transplanted into the sociotechnical environment of…
Although the Internet Archive's Wayback Machine is the largest and most well-known web archive, there have been a number of public web archives that have emerged in the last several years. With varying resources, audiences and collection…
We present a framework for web-scale archiving of the dark web. While commonly associated with illicit and illegal activity, the dark web provides a way to privately access web information. This is a valuable and socially beneficial tool to…
The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose…
The World Wide Web is the most wide known information source that is easily available and searchable. It consists of billions of interconnected documents Web pages are authored by millions of people. Accesses made by various users to pages…
An enormous volume of security-relevant information is present on the Web, for instance in the content produced each day by millions of bloggers worldwide, but discovering and making sense of these data is very challenging. This paper…
Web archiving services play an increasingly important role in today's information ecosystem, by ensuring the continuing availability of information, or by deliberately caching content that might get deleted or removed. Among these, the…
Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful…
Archiving Web pages into themed collections is a method for ensuring these resources are available for posterity. Services such as Archive-It exists to allow institutions to develop, curate, and preserve collections of Web resources.…
Archiving the web is socially and culturally critical, but presents problems of scale. The Internet Archive's Wayback Machine can replay captured web pages as they existed at a certain point in time, but it has limited ability to provide…
Web traffic is a valuable data source, typically used in the marketing space to track brand awareness and advertising effectiveness. However, web traffic is also a rich source of information for cybersecurity monitoring efforts. To better…
In recent years, journalists and other researchers have used web archives as an important resource for their study of disinformation. This paper provides several examples of this use and also brings together some of the work that the Old…
Web archives, a key area of digital preservation, meet the needs of journalists, social scientists, historians, and government organizations. The use cases for these groups often require that they guide the archiving process themselves,…
In order to evaluate, compare, and tune graph algorithms, experiments on well designed benchmark sets have to be performed. Together with the goal of reproducibility of experimental results, this creates a demand for a public archive to…