After we extracted and decoded the files, we now want to use them. So, what are they, how to use them?
Edit: I fixed the links
After my last post, you will end up with a lot of text files. We will now convert these text files to the actual binary files.
To do that we will first have to remove duplicated files. Transcode will sometimes extract one subtitle twice. Use your favourite image viewer (best with automatic search for duplicated files) to identify those. Do not delete them, as it could of course be possible (even if not very probable) that it wasn’t transcode’s fault but they are actually two identical subtitles.
The next step is to join all txt-files and to remove accidentally inserted spaces. A simple
sed -e “s/ //g” subtitles-*txt > base64.txt
entered on the console will do.
Now we have one file containing the base64-encoded source, Read the rest of this entry »
Now we get to the interesting part:
Converting the images to text.
I’m using pgm2txt for this purpose. Using database only mode (“-d”) ensures that we have full control. Using gocr intern heuristics gave many errors, e.g. “l” was read as “1”.
pgm2txt will ask you to enter the display text, whenever it cannot identify the characters. This means in database only mode, you will have to enter every character at least once. That would be a problem, but as the letters aren’t always clearly separated, you will often end up having to enter the text for combinations like “KWY”. Nevertheless, after having trained the db for some time, all you need is patience and cpu power. After some time (1000 pics => several hours) you will end up with a lot of text files, each containing one part of base64 encoded file.