Converting images to text

Now we get to the interesting part:
Converting the images to text.
I’m using pgm2txt for this purpose. Using database only mode (“-d”) ensures that we have full control. Using gocr intern heuristics gave many errors, e.g. “l” was read as “1”.

pgm2txt will ask you to enter the display text, whenever it cannot identify the characters. This means in database only mode, you will have to enter every character at least once. That would be a problem, but as the letters aren’t always clearly separated, you will often end up having to enter the text for combinations like “KWY”. Nevertheless, after having trained the db for some time, all you need is patience and cpu power. After some time (1000 pics => several hours) you will end up with a lot of text files, each containing one part of base64 encoded file.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: