From Scanned Pages to Semantic Graphs: Scalable Methods for Extracting Historical and Cultural Knowledge Across Heterogeneous Texts

Loading...
Thumbnail Image
Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
The Eurographics Association
Abstract
We present a multilayered methodology for processing digitized historical texts, enabling cross-relational analysis across time periods, languages, and subject domains. Drawing from multiple DH platforms (Tsadikim, Two Enlightenments, Corporeality), we demonstrate an integrated pipeline combining adaptive OCR, noise-tolerant keyword extraction, and NER. Custom preprocessing and fuzzy matching techniques allow for meaningful text recovery from degraded scans in Polish, German, and Yiddish. Data are enriched with spatial and temporal metadata, indexed by topic and linked across projects. The resulting datasets support trend analysis, social network modeling, and discourse mapping. Our approach enables researchers to trace linguistic shifts and intellectual networks over centuries without manual review of source pages. This workflow facilitates interoperable exploration of cultural data and demonstrates how machine learning can assist in recovering semantic relationships from fragmented historical records. The methodology was tested on Enlightenment-era and early 20th-century journals, revealing both technical challenges and insights into evolving ideological, medical, and theological vocabularies.
Description

CCS Concepts: Information systems → Digital libraries and archives; Computing methodologies → Natural language processing; Machine learning; Applied computing → Arts and humanities; Digital humanities; Human-centered computing → Visualization; Theory of computation → Ontologies

        
@inproceedings{
10.2312:dh.20253133
, booktitle = {
Digital Heritage
}, editor = {
Campana, Stefano
and
Ferdani, Daniele
and
Graf, Holger
and
Guidi, Gabriele
and
Hegarty, Zackary
and
Pescarin, Sofia
and
Remondino, Fabio
}, title = {{
From Scanned Pages to Semantic Graphs: Scalable Methods for Extracting Historical and Cultural Knowledge Across Heterogeneous Texts
}}, author = {
Malak, Piotr
and
Letowska, Agnieszka
and
Wodzinski, Jan
}, year = {
2025
}, publisher = {
The Eurographics Association
}, ISBN = {
978-3-03868-277-6
}, DOI = {
10.2312/dh.20253133
} }
Citation