Getting the files

After my last post, you will end up with a lot of text files. We will now convert these text files to the actual binary files.

To do that we will first have to remove duplicated files. Transcode will sometimes extract one subtitle twice. Use your favourite image viewer (best with automatic search for duplicated files) to identify those. Do not delete them, as it could of course be possible (even if not very probable) that it wasn’t transcode’s fault but they are actually two identical subtitles.

The next step is to join all txt-files and to remove accidentally inserted spaces. A simple
sed -e “s/ //g” subtitles-*txt > base64.txt
entered on the console will do.

Now we have one file containing the base64-encoded source, the md5sum for that file and the original file name. As I didn’t find a program which could handle these files the way I wanted I wrote the following python script to extract the file and check the md5sum:


#!/usr/bin/python
import sys, base64, md5

def hexToString(md5bin):
result = “”
for ch in md5bin:
realCh = hex(ord(ch)).replace(“0x”,””)
result += realCh
return result

if len(sys.argv) < 2:
print “””Usage: %s in_b64_enc_file

in_b64_enc_file – The Base64 encoded file to be converted
“””%sys.argv[0]
sys.exit(0)

f = file(sys.argv[1], ‘rb’)
s = f.readline()
md5string = “”
while (s.strip()):
sarray = s.strip().split(‘:’)
if (sarray[0] == “Content-MD5”):
md5base64 = sarray[1]
md5bin = base64.decodestring(md5base64)
md5string = hexToString(md5bin)
print “MD5 should be: ” + md5string
if (len(sarray)<2):
sarray = s.strip().split(“=”)
filename = sarray[1].strip(‘”‘)
print filename
s = f.readline()

fout = file(filename, ‘wb’)
s = f.read()
decoded = base64.decodestring(s)
fout.write(decoded)
f.close()
fout.close()
m = md5.new(decoded)
print “MD5 is: ” + hexToString(m.digest())
print “MD5 matches: ” + str(hexToString(m.digest()) == md5string)

If you were not careful enough (like me) and entered a wrong letter, or enter the right letter with the wrong case, when pgm2txt asked you, you will notice that now. Well, it’s only a question of hours, the convert the image to text again :-). If any one of you gets the right md5sum for Episode 2 (Barber.z5), please let me know. I always get it wrong even though I am able to use the file.

The next post will talk about the extracted file, and what to do with them.

Advertisements

One Response to Getting the files

  1. […] 2 aka Barber.z5 Let’s take a look at the 2nd Episode. As pointed out here I was unable to decode the file with the correct md5sum. Nevertheless I was able to use […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: