Friday, 18 April 2014

Tesseract

I’ve been having fun with Tesseract, an open source OCR engine. It works from the command line, taking image files (TIFF and JPEG work for me) and outputting plain text.

That’s all. It doesn’t do anything fancy overlay text on an image to generate a searchable pdf (it does output hOCR and handles multiple columns, so I assume that the output can be processed, although I’ve not looked into that). I assume most people who want to scan a document with OCR will want a facsimile of that document, just with searchable text.

That makes Tesseract’s usefulness a bit marginal. But on the other hand, I am a marginal usage case. I just want the text, nothing fancy. Why? Because (awful hipster that I am) I typed this on a typewriter.

Olympia

Tesseract is very good at doing what it does. I’ve trialled other commercial OCR software and the accuracy when scanning single column text from my typewriter doesn’t come close to Tesseract’s output. It’s not perfect, but it’s something I can live with, and typing on the Olympia beats staring at a screen.

So, if by chance you’re also a filthy typewriter fetishist who wants to use their machine more often but is held back by the need to get text in electronic format, give it a try. I can’t comment on Windows, but the Macports version installed just fine on both Snow Leopard and Tiger.

Scanning settings are not something I’ve looked into too much; the best results seem to be using high contrast B&W for photographs, rather than default settings for documents. I confess however that I’ve not been too adventurous with my scanner, sticking with the default Canon drivers because I couldn’t get SANE to work just yet.

Afterword:

Tesseract does require a bit of post-processing. I’m happy to say the above text was produced with 100% accuracy (including typos); however it did insert the odd line or two. The main frustration is the hard line breaks, e.g.

Blog Tesseract Crop

will output as

So, if by chance you’re also a filthy typewriter fetishist who
wants to use their machine more often but is held back by the
need to get text in electronic format, give it a try. I can’t
comment on Windows, but the Macports version installed just fine
on both Snow Leopard and Tiger.

The quickest way is probably to shove it into Word and do a special Find/Replace to swap paragraph marks (^p; the OpenOffice equivalent is n) with spaces.

Sunday, 19 January 2014

Wednesday, 30 October 2013

Happy Halloween!

The nice people who pay my salary have kept my busy of late, so October has been a lean month for posting. Still, it’s been chock full of actual stuff happening, so in no particular order:

The End

I finished Dreadful Secrets of Candlewick Manor, after a 6 month hiatus between that and the penultimate session (whoops). I think the players liked it. There was way more mayhem than expected, several people died or nearly died, and the players forgot they were playing children and played the monsters I’d hoped they were going to become. I call that a result.

We came to the conclusion that ORE, or at least the way I ran it, has a bias in favour of hits to the face and head — so the top tip for MaoCT powergamers is to go for a PC with a really big forehead. I could analyse further but I’m unlikely to run an ORE game again, much as I respect the effort that’s gone into some of the titles.

(I do still like the simultaneous rolling and sets counting — but whereas Hollowpoint gets it right, I think ORE is a bit flawed).

Bundle of Holding

Check out the latest Bundle of Holding! I donated and got a whole lot of Cthulhu goodness for it. I really wanted the Trail of Cthulhu and Eldrich Skies titles, but I’m looking forward to reading the Cublicle 7 offerings as well, and the Kenneth Hite Tarot of Cthulhu is proper fun. Recommended, and it’s for two great charities — the Alzheimer’s Association and Cancer Research UK.

Hurry, you have a few days left!

Birthdays

My birthday was low key on account of being jet-lagged (though that didn’t stop me taking part in a fun quarterstaff class with Paul Wagner). This is what my lovely wife had waiting for me when I arrived back in the UK:

Traveller 2

It’s an Olympia Traveller de Luxe S, approximately as old as I am. It’s not too much bigger (or heavier) than either of my laptops:

Traveller

I expected that (a) I wouldn’t hit the keys hard enough, (b) I’d injure my fingers and (c) I wouldn’t be able to touch type but actually I’m doing fine on all 3 counts. Words are coming out mostly with the letters in the right order, and fast. The pages even scan OK for an electronic copy, though OCR is a bit hit and miss.

Traveller 1

(typing my impressions of Lacuna, for another post — soon!)

And that’s about it. I’ve been travelling for 2 weeks out of four, and you’ll be pleased to hear it’s nice and warm and sunny if you’re not in the UK. Well, based on my sample set of two.

More in November!