Source criticism, bias, and representativeness in the digital age: A case study of digitized newspaper archives

Historians must critically scrutinize their sources, a task further complicated in the digital age by the need to evaluate the technical infrastructure of digital archives. This article critically examines digital newspaper archives, revealing error rates in optical character recognition (OCR) that compromise result reliability, and word frequency-based datasets that introduce biases due to issues in the shaping of the OCR corpus and later post-processing. Beyond technical issues, copyright restrictions hinder access to crucial newspapers, while incomplete archives pose representativeness challenges. Accessing datasets from different countries is cumbersome. Commercial archives are costly, and uneven publication rates necessitate corrections over time. The use of digital archives presents new exercises: the researcher needs to explain the reliability of the digital source, which often can only be achieved in interdisciplinary working groups. The digital archives must ensure transparency by detailing to researchers the technical manipulations performed on the original source.

16 pages in: DHNB2024 From Experimentation to Experience: Lessons Learned from the Intersections between Digital Humanities and Cultural Heritage May 29 – 31, 2024, Reykjavik, Iceland. Digital Humanities in the Nordic and Baltic Countries Publications – ISSN 2704-1441.

Scroll to Top