We’ve got the files, but what now?

November 9, 2007

After we extracted and decoded the files, we now want to use them. So, what are they, how to use them?

At this stage, we take a look at http://www.ilovewillies.com (Only awaylable in the Google-cache)

Edit: I fixed the links

Read the rest of this entry »

Advertisements

Getting the files

November 8, 2007

After my last post, you will end up with a lot of text files. We will now convert these text files to the actual binary files.

To do that we will first have to remove duplicated files. Transcode will sometimes extract one subtitle twice. Use your favourite image viewer (best with automatic search for duplicated files) to identify those. Do not delete them, as it could of course be possible (even if not very probable) that it wasn’t transcode’s fault but they are actually two identical subtitles.

The next step is to join all txt-files and to remove accidentally inserted spaces. A simple
sed -e “s/ //g” subtitles-*txt > base64.txt
entered on the console will do.

Now we have one file containing the base64-encoded source, Read the rest of this entry »


Converting images to text

November 5, 2007

Now we get to the interesting part:
Converting the images to text.
I’m using pgm2txt for this purpose. Using database only mode (“-d”) ensures that we have full control. Using gocr intern heuristics gave many errors, e.g. “l” was read as “1”.

pgm2txt will ask you to enter the display text, whenever it cannot identify the characters. This means in database only mode, you will have to enter every character at least once. That would be a problem, but as the letters aren’t always clearly separated, you will often end up having to enter the text for combinations like “KWY”. Nevertheless, after having trained the db for some time, all you need is patience and cpu power. After some time (1000 pics => several hours) you will end up with a lot of text files, each containing one part of base64 encoded file.