Latin OCR

View the Project on GitHub latin-ocr

Latin OCR

This project aims to develop high-quality OCR processes and results for digitized Latin books. For the first phase of the project, we're OCR'ing 21,509 books identified as potentially containing Latin content in the Internet Archive, using Tesseract 3.04.00 with the v0.3.0 Latin and v2.0 Ancient Greek language training files. For an overview of the project, see this poster PDF or this presentation.

This project is run by Ryan Baumann of the Duke Collaboratory for Classics Computing.

Search

Search the current OCR results by entering terms in the form below: