Now we get to the interesting part:
Converting the images to text.
I’m using pgm2txt for this purpose. Using database only mode (“-d”) ensures that we have full control. Using gocr intern heuristics gave many errors, e.g. “l” was read as “1”.
pgm2txt will ask you to enter the display text, whenever it cannot identify the characters. This means in database only mode, you will have to enter every character at least once. That would be a problem, but as the letters aren’t always clearly separated, you will often end up having to enter the text for combinations like “KWY”. Nevertheless, after having trained the db for some time, all you need is patience and cpu power. After some time (1000 pics => several hours) you will end up with a lot of text files, each containing one part of base64 encoded file.