I’ve been having fun with Tesseract, an open source OCR engine. It works from the command line, taking image files (TIFF and JPEG work for me) and outputting plain text.
That’s all. It doesn’t do anything fancy overlay text on an image to generate a searchable pdf (it does output hOCR and handles multiple columns, so I assume that the output can be processed, although I’ve not looked into that). I assume most people who want to scan a document with OCR will want a facsimile of that document, just with searchable text.
That makes Tesseract’s usefulness a bit marginal. But on the other hand, I am a marginal usage case. I just want the text, nothing fancy. Why? Because (awful hipster that I am) I typed this on a typewriter.
Tesseract is very good at doing what it does. I’ve trialled other commercial OCR software and the accuracy when scanning single column text from my typewriter doesn’t come close to Tesseract’s output. It’s not perfect, but it’s something I can live with, and typing on the Olympia beats staring at a screen.
So, if by chance you’re also a filthy typewriter fetishist who wants to use their machine more often but is held back by the need to get text in electronic format, give it a try. I can’t comment on Windows, but the Macports version installed just fine on both Snow Leopard and Tiger.
Scanning settings are not something I’ve looked into too much; the best results seem to be using high contrast B&W for photographs, rather than default settings for documents. I confess however that I’ve not been too adventurous with my scanner, sticking with the default Canon drivers because I couldn’t get SANE to work just yet.
Tesseract does require a bit of post-processing. I’m happy to say the above text was produced with 100% accuracy (including typos); however it did insert the odd line or two. The main frustration is the hard line breaks, e.g.
will output as
So, if by chance you’re also a filthy typewriter fetishist who
wants to use their machine more often but is held back by the
need to get text in electronic format, give it a try. I can’t
comment on Windows, but the Macports version installed just fine
on both Snow Leopard and Tiger.
The quickest way is probably to shove it into Word and do a special Find/Replace to swap paragraph marks (^p; the OpenOffice equivalent is n) with spaces.