How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures

Abstract

In the last decade, OCR progress has triggered a massive trend towards the digitisation of legacy documents, with several Digital Humanities projects exploring means for structuring retro-digitised dictionaries. However there is a lack of awareness of the impact of the OCRs quality on the information extraction process. In this work, we shed light on the relationship between these two steps through experiments carried out with a TEI-based system for automatic parsing of dictionaries.

Type
Publication
19th annual Conference and Members’ Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Senior Research Scientist

I’m a Senior Research Scientist at the Common Crawl Foundation.