Detecting and Diagnosing Errors in Serving Archived Web Pages

Jingyuan Zhu, University of Michigan; Huanchen Sun and Harsha V. Madhyastha, University of Southern California

Community Award Winner!

Web archives crawl and save copies of pages from the web, enabling users to interact with web pages in the form they existed in the past. Prior to serving any archived page, an archive rewrites the page’s source so that users’ browsers fetch the page’s resources from the archive, not from the servers which originally hosted them. But, on many modern pages, an archive’s edits to crawled scripts result in a loss of fidelity, i.e., an archived copy fails to accurately mimic the original page even when the archive had crawled all resources on the page.

To help the developers of archival systems identify and fix the bugs which result in incorrect rewrites of crawled pages, we present FidEx. First, FidEx enables accurate identification of the pages on which an archive violates fidelity. It does so by tracking and comparing the execution of scripts between when a page is crawled and when its copy is loaded. In comparison to existing methods which compare the two loads using either screenshots or the errors reported by the browser, FidEx reduces the false positive rate from around 70% to less than 10%. Second, on every page on which it identifies a loss of fidelity, FidEx pinpoints which subset of the archive’s edits to the page are erroneous. Leveraging this input to fix bugs in the most widely used archival system, we reduced the fraction of archived pages which violate fidelity from 15% to 9%.

NSDI '26 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {316768,
author = {Jingyuan Zhu and Huanchen Sun and Harsha V. Madhyastha},
title = {Detecting and Diagnosing Errors in Serving Archived Web Pages},
booktitle = {23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26)},
year = {2026},
isbn = {978-1-939133-54-0},
address = {Renton, WA},
pages = {669--684},
url = {https://www.usenix.org/conference/nsdi26/presentation/zhu-jingyuan},
publisher = {USENIX Association},
month = may
}

Presentation Video