Monday 16 September 2013

How to get text from a screenshot using OCR

An interesting conundrum today - how to get a plain text copy of a long list of options in a bit of software that isn't copy/paste-able.
OCR This! :-)

I should point out before I start that I'm using Ubuntu 12.04

So, anyway, I screen-captured the areas of the screen that formed the list into a series of image files. At this point I tried OCRing one of them straight off the bat, using the tesseract OCR.



tesseract input.png output

It didn't go to well, and looked a bit like this:

tH3 i tt s3f 1 
5ce eg p e3t s 1nage amd
c0mf 1 1 b@d c tot ck!


Not the most useful.

I figured it couldn't read the images because they were too small or blurry, so I scaled them up to twice the size of the original and reduced the colour depth from 16 million colours to just 2 - black and white.

I tried it again:

tesseract input2.png output2

And this time, it was more like this:


This is a test to see if l can 
screengrab text as an image and
convert it back to taxt!



So, still not quite 100% but very, very close.  Good enough for my purposes.

To install tesseract, if you are using Ubuntu 12.04 (or similar)  -

apt-get install tesseract-ocr



No comments:

Post a Comment

Concatenating CSVs the easy way

I was recently asked how to 'merge' a few CSV files into one, and if there was a script or tool that could do that. Lets say you h...