A NLP Enhanced Visual Analytics Tool for Archives Metadata

Ozdemir, Anil; Müstecep, Dilara; Agaoglu, Orhan; Balcisoy, Selim

A NLP Enhanced Visual Analytics Tool for Archives Metadata

Date

2020

Authors

Ozdemir, Anil
Müstecep, Dilara
Agaoglu, Orhan
Balcisoy, Selim

Publisher

The Eurographics Association

Abstract

Today, almost all cultural heritage (CH) institutions are starting to digitize parts of their collections and archives to improve accessibility, preservation of originals, publicity, and visibility of the institution on the Internet. With this recent development, digital document collections have been multiplying. These collections are spread over more than one area of life in a vast domain, including art, history, mathematics, physics, etc. Such a situation creates a substantial volume of documents digitally available. Also, it creates the need for various approaches that allow users to understand latent meanings in collections, discover and investigate relationships, and extract the necessary information from collections. To address this need, we introduce a visual exploratory tool that facilitates the uncovering of hidden information and stories underlying documents, extracting the key individuals, temporal expressions, locations, entities, and keywords within the documents ,establishing a network between documents and allow researchers and archivists to form and test hypotheses and observe individual relationships, networks, and stories present in the archives metadata collections.Consequently, we have designed and developed a visual exploration tool for large archives with limited metadata employing state of the art Natural Language Processing (NLP) techniques to assist cultural heritage researchers. To design such a tool, we have collaborated with archive professionals from an cultural institution, SALT (https:// saltonline.org/) which focused on public service producing research-based exhibitions, publications, and digitization projects. As a result of our conversations Salt team we decided to use Waqfs of Crete which is an archive consisting of official records of Muslim inhabitants of Crete. Documents spanning the period from 1825 to 1928 in Ottoman Turkish and Greek provide an opportunity to examine the multi-layered social structure on the island, especially from a cultural and economic perspective. The metadata contains information for approximately 10 thousand documents and includes the summary of those documents, the year they were published, the location, the language used, and the documents' picture. Also, We extracted various features including locations, key individuals, dates, entities and keywords from the document summaries on metadata using NLP methods including regular expressions for extracting , and word embedding models for capturing similarities between documents. We have integrated all of these features into designed tool to let the user to see networks that can represent the relationship between documents, as well as easily access similar documents in the archive. In the network we demonstrated, particular nodes correspond to the documents itself. To assign an weighted edge between two documents in the network, the total number of shared individuals and keywords between documents are computed and edges are set based on a predetermined threshold value. This threshold has been found by manually tweaking both considering the speed at which the result is reflected on the application and average number of shared attributes. To capture similarity between documents, we used state-of-theart word embedding models including Word2vec, FastText and Transformer which provides a method to compute dense vector representations for documents. Consequently, each document was represented as fixed-sized mathematical vectors as output of each model, and the similarity between documents was calculated by taking the arithmetic cosine similarities of vectors. The designed interface consisting of six components which includes interactive map that allows the user to view documents in different locations and view the document networks that formed by calculating total number of shared attributes between documents. Remaining components include information box that contains document-specific attributes such as location, time, person, entities, and keyword, document browser that enable users and researchers to browse documents easily, individual and keyword search menu and filtering panel. In this way, the users may find documents that are roughly related to each other very quickly. Later, the user can browse each document on its network and view documents that have common individuals and keywords with each other. Thus, the user may follow the interactions between documents like a story and able to do this for all the people who lived in the 19th century on Crete's island.

        @inproceedings{10.2312:gch.20201297
,
booktitle = {Eurographics Workshop on Graphics and Cultural Heritage
},
editor = {Spagnuolo, Michela and Melero, Francisco Javier
},
title = {{A NLP Enhanced Visual Analytics Tool for Archives Metadata
}},
author = {Ozdemir, Anil and 
Müstecep, Dilara and 
Agaoglu, Orhan and 
Balcisoy, Selim
},
year = {2020
},
publisher = {The Eurographics Association
},
ISSN = {2312-6124
},
ISBN = {978-3-03868-110-6
},
DOI = {10.2312/gch.20201297
}
}

URI

https://doi.org/10.2312/gch.20201297
https://diglib.eg.org:443/handle/10.2312/gch20201297

Collections

GCH 2020 - Eurographics Workshop on Graphics and Cultural Heritage

Full item page