Related papers: Web Archive Analytics

Improved methodology for longitudinal Web analytics using Common Crawl

Common Crawl is a multi-petabyte longitudinal dataset containing over 100 billion web pages which is widely used as a source of language data for sequence model training and in web science research. Each of its constituent archives is on…

Networking and Internet Architecture · Computer Science 2024-04-16 Henry S. Thompson

ArchiveWeb: collaboratively extending and exploring web archive collections - How would you like to work with your collections?

Curated web archive collections contain focused digital content which is collected by archiving organizations, groups, and individuals to provide a representative sample covering specific topics and events to preserve them for future…

Digital Libraries · Computer Science 2017-02-03 Zeon Trevor Fernando , Ivana Marenzi , Wolfgang Nejdl

ArchiveWeb: Collaboratively Extending and Exploring Web Archive Collections

Curated web archive collections contain focused digital contents which are collected by archiving organizations to provide a representative sample covering specific topics and events to preserve them for future exploration and analysis. In…

Digital Libraries · Computer Science 2017-02-02 Zeon Trevor Fernando , Ivana Marenzi , Wolfgang Nejdl , Rishita Kalyani

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller…

Digital Libraries · Computer Science 2017-02-06 Helge Holzmann , Vinay Goel , Avishek Anand

Big Data Science Over the Past Web

Web archives preserve unique and historically valuable information. They hold a record of past events and memories published by all kinds of people, such as journalists, politicians and ordinary people who have shared their testimony and…

Digital Libraries · Computer Science 2021-08-04 Miguel Costa , Julien Masanès

Analyzing Web Archives Through Topic and Event Focused Sub-collections

Web archives capture the history of the Web and are therefore an important source to study how societal developments have been reflected on the Web. However, the large size of Web archives and their temporal nature pose many challenges to…

Digital Libraries · Computer Science 2016-12-20 Gerhard Gossen , Elena Demidova , Thomas Risse

Bots, Seeds and People: Web Archives as Infrastructure

The field of web archiving provides a unique mix of human and automated agents collaborating to achieve the preservation of the web. Centuries old theories of archival appraisal are being transplanted into the sociotechnical environment of…

Digital Libraries · Computer Science 2016-11-09 Ed Summers , Ricardo Punzalan

How Much of the Web Is Archived?

Although the Internet Archive's Wayback Machine is the largest and most well-known web archive, there have been a number of public web archives that have emerged in the last several years. With varying resources, audiences and collection…

Digital Libraries · Computer Science 2013-01-08 Scott G. Ainsworth , Ahmed AlSum , Hany SalahEldeen , Michele C. Weigle , Michael L. Nelson

Introducing A Dark Web Archival Framework

We present a framework for web-scale archiving of the dark web. While commonly associated with illicit and illegal activity, the dark web provides a way to privately access web information. This is a valuable and socially beneficial tool to…

Digital Libraries · Computer Science 2021-07-12 Justin F. Brunelle , Ryan Farley , Grant Atkins , Trevor Bostic , Marites Hendrix , Zak Zebrowski

Modeling Updates of Scholarly Webpages Using Archived Data

The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose…

Digital Libraries · Computer Science 2021-04-29 Yasith Jayawardana , Alexander C. Nwala , Gavindya Jayawardena , Jian Wu , Sampath Jayarathna , Michael L. Nelson , C. Lee Giles

Web Usage mining framework for Data Cleaning and IP address Identification

The World Wide Web is the most wide known information source that is easily available and searchable. It consists of billions of interconnected documents Web pages are authored by millions of people. Accesses made by various users to pages…

Databases · Computer Science 2014-08-26 Priyanka Verma , Nishtha Kesswani

Web Analytics for Security Informatics

An enormous volume of security-relevant information is present on the Web, for instance in the content produced each day by millions of bloggers worldwide, but discovering and making sense of these data is very challenging. This paper…

Social and Information Networks · Computer Science 2013-01-01 Kristin Glass , Richard Colbaugh

Understanding Web Archiving Services and Their (Mis)Use on Social Media

Web archiving services play an increasingly important role in today's information ecosystem, by ensuring the continuing availability of information, or by deliberately caching content that might get deleted or removed. Among these, the…

Computers and Society · Computer Science 2018-04-10 Savvas Zannettou , Jeremy Blackburn , Emiliano De Cristofaro , Michael Sirivianos , Gianluca Stringhini

Building and Querying Semantic Layers for Web Archives (Extended Version)

Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful…

Digital Libraries · Computer Science 2018-10-25 Pavlos Fafalios , Helge Holzmann , Vaibhav Kasturia , Wolfgang Nejdl

Stories From the Past Web

Archiving Web pages into themed collections is a method for ensuring these resources are available for posterity. Services such as Archive-It exists to allow institutions to develop, curate, and preserve collections of Web resources.…

Digital Libraries · Computer Science 2017-05-18 Yasmin AlNoamany , Michele C. Weigle , Michael L. Nelson

ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph

Archiving the web is socially and culturally critical, but presents problems of scale. The Internet Archive's Wayback Machine can replay captured web pages as they existed at a certain point in time, but it has limited ability to provide…

Information Retrieval · Computer Science 2013-06-12 Ahmed AlSum , Michael L. Nelson

Using Google Analytics to Support Cybersecurity Forensics

Web traffic is a valuable data source, typically used in the marketing space to track brand awareness and advertising effectiveness. However, web traffic is also a rich source of information for cybersecurity monitoring efforts. To better…

Information Retrieval · Computer Science 2019-04-04 Han Qin , Kit Riehle , Haozhen Zhao

The Use of Web Archives in Disinformation Research

In recent years, journalists and other researchers have used web archives as an important resource for their study of disinformation. This paper provides several examples of this use and also brings together some of the work that the Old…

Digital Libraries · Computer Science 2023-06-19 Michele C. Weigle

The Many Shapes of Archive-It

Web archives, a key area of digital preservation, meet the needs of journalists, social scientists, historians, and government organizations. The use cases for these groups often require that they guide the archiving process themselves,…

Digital Libraries · Computer Science 2021-01-26 Shawn M. Jones , Alexander Nwala , Michele C. Weigle , Michael L. Nelson

The Open Graph Archive: A Community-Driven Effort

In order to evaluate, compare, and tune graph algorithms, experiments on well designed benchmark sets have to be performed. Together with the goal of reproducibility of experimental results, this creates a demand for a public archive to…

Data Structures and Algorithms · Computer Science 2011-11-10 Christian Bachmaier , Franz J. Brandenburg , Philip Effinger , Carsten Gutwenger , Jyrki Katajainen , Karsten Klein , Miro Spönemann , Matthias Stegmaier , Michael Wybrow