ERA is in the process of being migrated to Scholaris, a Canadian shared institutional repository service (https://scholaris.ca). Deposits to existing ERA collections are frozen until migration is complete. Please contact erahelp@ualberta.ca for further assistance
- 14 views
- 12 downloads
Towards a better QA process: Automatic detection of quality problems in archived websites using visual comparisons
-
Towards a better QA process
-
- Author(s) / Creator(s)
-
For web archivists, Quality Assurance (QA) is a lengthy manual process that involves inspecting hundreds or thousands of archived websites to see if they have been captured correctly, i.e., resemble the original. This paper describes how this process can be automated by using image comparison measures to detect quality problems in archived websites. To this end, we created a suite of Python tools to 1) create screenshots of live websites and their archived counterparts, and 2) calculate the image similarity between the screenshots. We tested our code on four web archive collections to test the efficacy and usefulness of six different image similarity measures. We compared their scores to human judgments of the quality of archived websites obtained from Amazon Mechanical Turk (AMT). Our results show that the Structural Similarity Index (SSIM) and the Normalized Root Mean Square (NRMSE) scores are able to distinguish between high and low-quality archived websites. Our research at every step was informed by the specific needs and challenges of web archivists. Having methods such as the one presented here can allow cultural heritage institutions or researchers to more quickly and effectively detect low-quality content and produce high-quality web archives.
-
- Date created
- 2025-02-20
-
- Subjects / Keywords
-
- Type of Item
- Conference/Workshop Presentation