Using Tools to Automate the Extraction of Metadata and Text for Research

By Aimee Xu published on Mar 14th, 2024

Using Tools to Automate the Extraction of Metadata and Text for Research

Over the Summer and Fall quarters, Lukas Hager, a fourth-year Math of Computation student on the UCLA Datasquad, explored methodologies that could aid researchers in analyzing content in UCLA Library archives. One standard method, Optical Character Recognition (OCR), involves a range of technologies that use pattern-matching algorithms to convert scanned images of text into PDFs with searchable text. UCLA Library collections contain texts from many years, mediums, and languages, whether from a political campaign image or an ancient transcription. This work is necessary for making the content of scanned documents or images accessible for further analysis by UCLA researchers and collaborators. An essential aspect of this project was preserving the original structure of the scanned materials, as maintaining the fidelity of layouts in newspapers, manuscripts, and tabular data is crucial for accurate research and analysis. In this article, we outline Lukas’s steps to effectively implement OCR technologies, enhancing the accessibility and usefulness of archival documents for research.

Step 1: Google Tesseract - Finding an OCR Engine

Hager began his research with Tesseract software, an OCR engine developed in the 1980s by Hewlett-Packard and further improved and managed by Google. Tesseract is an open-source software that converts scanned documents or images into editable text. It supports over 100 languages. The quality of OCR results can vary, and additional training might be necessary for optimal performance in some languages. Tesseract’s community constantly develops the software for better quality and language support. Despite its robust features, Tesseract may still produce typos and errors, notably with non-Latin scripts such as Arabic and Hebrew, which are read from right to left. Understanding and addressing these challenges are crucial, especially for projects involving diverse text types like those in UCLA’s collections. Training the engine to handle these scripts better was vital to Hager’s project.

Step 2: Named Entity Recognition and Named Entity Linking (NER/NEL) - Linking the Metadata

Paris France NEL

Using the text generated by OCR, Named Entity Recognition (NER) can be applied to identify and categorize entities in the extracted text. These recognized named entities are then linked to relevant entries in a knowledge base through a process known as Named Entity Linking (NEL), which involves associating identified entities with unique identifiers in a structured database. An entity is a word or group of words that can be linked to the same concept. Common entity categories include Person, Organization, Place, and Event. Entity linking is especially useful in providing a high-level overview of a large text corpus. Researchers, who previously had to sift through thousands of pages, can now use NEL to quickly ascertain a body of work’s central themes and relevance.

Hager explained, “NEL programs identify named figures within a text and link them to a Wikipedia and Wikidata database.” He uses ReFinED, an end-to-end NEL developed by Amazon Research, which connects entity mentions in documents to their corresponding entities. Users of his OCR software benefit not only from having their documents converted into searchable text but also from having frequently mentioned terms hyperlinked to relevant database records. Lukas notes, “ReFinED is a great resource for hyperlinking mentions of a text, even with typos, but may not be very user-friendly.”

Step 3: Analyzed Layout and Text Object (ALTO) - Displaying the Text

After comprehensive research on character recognition and entity linking, Hager integrated a User Interface into his OCR pipeline to display the processed text. This enhancement aims to make metadata more accessible to UCLA researchers and offers a solution to the opaque nature of Named Entity Linking. He explored ALTO and hOCR, two prevalent formats for OCR output. Analyzed Layout and Text Object (ALTO), initially developed to describe the text OCR and layout information of pages for digitized materials, is still utilized by the Library of Congress to preserve the original content layout. The goal was to describe the text’s layout in a form to reconstruct the original content. Similarly, hOCR, another format for representing OCR output, is favored by the Internet Archive, a non-profit digital library.

Step 4: Tabular Extraction - Organizing the OCR Output

The final step in developing the OCR project is to transform the image output from the previous steps into a readable and accessible dataframe, such as a CSV file. Tabular extraction involves separating tables from large documents and potentially recognizing individual rows, columns, or elements. This process allows users to work with text or numeric from newspapers or magazines in an organized spreadsheet format.

Lukas Hager’s project underscores the potential of OCR, NER/NEL, and tabular extraction tools in enhancing the accessibility and research impact of Library digital collections. By converting scanned images into machine-readable text, identifying and linking named entities, and preserving the original structure of documents, these technologies can pave the way for transforming library collections into computable resources, ready novel research applications.