How-To Guide: Using Text Editors and Grammar Tools to Clean Up OCR Text

Naomi Salmon

Resources

How-To Guide: Using Text Editors and Grammar Tools to Clean Up OCR Text

Depending on the text you are converting, you may already have access to a high-quality, live-text scan. If you are working with a scan you made yourself, however, you may need to run and then hand-correct the OCR (or “Optical Character Recognition”) for your text. OCR software has improved in recent years and there are a number of programs you can use to recognize text in an image. Adobe Acrobat does a decent job of recognizing text in many cases, but know that it isn’t always as adept at recognizing Victorian fonts as the 21st-century standard font sets we work with the most often today.

In some cases, you may have to hand-correct substantial portions of text, which may be fine for shorter articles but can be a particular challenge for novels. If you’re working with a popular text and can find the same edition in a web archive or database, you may be able to work from existing OCR text without reinventing the wheel and hand-editing all of your scans yourself. While some databases make copyright or end-user license agreement claims over their digitized scans, faithful transcriptions of public domain texts aren’t protected by copyright law. This means that you should be able to replicate and then clean up the OCR text from such databases to use as the starting-point for your participatory editions.

One thing to note: The internet is filled with cataloging errors, so even when a text has a title or metadata that appears to align with your chosen edition, it’s a good idea to verify by comparing significant portions of your scans to any existing OCR text you’re able to locate.

For example, while Project Gutenberg is excellent in many ways, its text transcriptions can be inconsistent, sometimes leaving out references to a novel’s edition entirely and sometimes hybridizing multiple editions of a text. This was the case with its text of Wilkie Collins’s The Woman in White. The dominant Project Gutenberg text doesn’t specify any edition publication information, but from variations in chapter details, it appears to be a combination of the 1860 three-volume edition and a later one-volume text published in the 1870s.

The following guide outlines ways to make your editing process more efficient once you have a tolerably accurate OCR scan to work with. In many cases, even fairly clean text will still have odd spacing and the occasional misplaced punctuation. Two free tools I have found useful for catching the ‘low-hanging fruit’ errors in these scans are the text editor SublimeText and the browser plugin Grammarly. However, the following general strategies will also work with Word and other text editors or spellcheckers.

The interactive video below is silent with text accompaniment.

(A caveat: at two points, the video speed accelerates in a way that might affect viewers with photosensitivity.)

License

Icon for the Creative Commons Attribution 4.0 International License

License

Share This Book