Home Download Buy Blog Forum Support

What encoding is used to display the contents of a docx file

What encoding is used to display the contents of a docx file

Postby mhl on Tue Aug 20, 2013 4:17 pm

Hi all. Bit of a newb question... I've googled it to death and come up with nothing, so here I am :)

When I open a docx file in sublime, it comes up in what I think it hex (with the 0x removed). See sample below.

But I've tried to put it through some online translators and it's not what I'm expecting. Different translators return different values too, so now I'm really confused.... :?

Could anyone confirm what notation is actually is and how I could go about translating it?




Code: Select all
4b01 022d 0014 0006 0008 0000 0021 00a9
c85c aa8c 0000 00da 0000 0013 0000 0000
0000 0000 0000 0000 00aa 8800 0063 7573
746f 6d58 6d6c 2f69 7465 6d31 2e78 6d6c
504b 0102 2d00 1400 0600 0800 0000 2100
1943 a3f0 f202 0000 ab0b 0000 1200 0000
0000 0000 0000 0000 0000 8f89 0000 776f
7264 2f66 6f6e 7454 6162 6c65 2e78 6d6c
504b 0102 2d00 1400 0600 0800 0000 2100
a204 be70 9703 0000 3e32 0000 1400 0000
0000 0000 0000 0000 0000 b18c 0000 776f
7264 2f77 6562 5365 7474 696e 6773 2e78
6d6c 504b 0102 2d00 1400 0600 0800 0000
2100 3def ea7f 8b09 0000 c136 0000 1a00
0000 0000 0000 0000 0000 0000 7a90 0000
776f 7264 2f73 7479 6c65 7357 6974 6845
Last edited by mhl on Tue Aug 20, 2013 8:10 pm, edited 1 time in total.
mhl
 
Posts: 5
Joined: Tue Aug 20, 2013 4:04 pm

Re: How can I decode the hex like contents of a docx file?

Postby iamntz on Tue Aug 20, 2013 4:36 pm

AFAIK docx is an archive. You could try to open it with winrar/7zip/whatever.
iamntz
 
Posts: 910
Joined: Fri Apr 29, 2011 8:52 am
Location: Romania

Re: How can I decode the hex like contents of a docx file?

Postby mhl on Tue Aug 20, 2013 5:13 pm

Hi iamntz, thanks for your help.

I have a corrupt docx document that I'm trying to debug.

Unzipping the archive didn't show anything useful. Opening it in sublime however, did. The endings are different. So I'm trying to work out exactly what those endings mean!

Ending One

Code: Select all
6f72 642f 7374 796c 6573 2e78 6d6c 504b
0506 0000 0000 0b00 0b00


Ending Two

Code: Select all
6f72 642f 7374 796c 6573 2e78 6d6c 504b
0506 0000 0000 0b00 0b00 c102 0000 ed24
mhl
 
Posts: 5
Joined: Tue Aug 20, 2013 4:04 pm

Re: How can I decode the hex like contents of a docx file?

Postby Veedrac on Tue Aug 20, 2013 11:12 pm

mhl wrote:I have a corrupt docx document that I'm trying to debug.

Is it private?
Veedrac
 
Posts: 19
Joined: Sun Aug 18, 2013 4:16 pm

Re: What encoding is used to display the contents of a docx file

Postby mhl on Wed Aug 21, 2013 4:36 pm

No not at all! It won't allow me to upload docx extension files, but I have put them here:

http://fresherandprosper.com/cvsamples/testcv.corrupted.docx

http://fresherandprosper.com/cvsamples/testcv.notcorrupted.docx

However, the trouble I'm having is that my file is corrupted in a slightly different way every time. That's why I'm trying to find a way to decode the endings of files myself.
mhl
 
Posts: 5
Joined: Tue Aug 20, 2013 4:04 pm

Re: What encoding is used to display the contents of a docx file

Postby Veedrac on Wed Aug 21, 2013 6:42 pm

mhl wrote:No not at all! It won't allow me to upload docx extension files, but I have put them here:

http://fresherandprosper.com/cvsamples/testcv.corrupted.docx

http://fresherandprosper.com/cvsamples/testcv.notcorrupted.docx

However, the trouble I'm having is that my file is corrupted in a slightly different way every time. That's why I'm trying to find a way to decode the endings of files myself.


The only difference between the files is that the "corrupted" one ends with and extra "--------" (2d2d 2d2d 2d2d 2d2d).

Additionally, with
mhl wrote:my file is corrupted in a slightly different way every time

I no longer have any semblance of what your purpose is.

If you're trying to decode how things are stored, you might want to look online for a zipfile "spec sheet", because that's what it is.
Veedrac
 
Posts: 19
Joined: Sun Aug 18, 2013 4:16 pm

Re: What encoding is used to display the contents of a docx file

Postby mhl on Thu Aug 22, 2013 1:42 pm

Ok I'll try to be clearer - it's been a long and winding road that's taken me to this point, so I didn't want to go off topic.

I've been trying to post binary files to an API. All appeared to be working fine, except for docx files, that were becoming corrupted during the transfer.

Because the docx files were OK before the transfer, I suspected that I was unintentionally adding extra data to files during the binary post.

That lead me to investigate the binary contents of the docx files with Sublime, and to observe extra characters being added to the end. (in this case "2d2d 2d2d 2d2d 2d2d").

HOWEVER... each time I post the file, the extra characters are slightly different (it appears to be the same sequence, but truncated in a different position - e.g. "2d2d 2d" or "2d2d 2d2d 2d2d 2d2d").

That's where I'm at now.

I figured it would be useful to translate those endings into ascii. But I'm very newb to this and am not exactly sure what I am translating from. I have seen the code in Sublime referred to as binary, but it clearly isn't binary. It looks like hex, but then there are no "0x" characters in there. :?

My immediate question is: How do you translate 2d2d2d into --- ?
mhl
 
Posts: 5
Joined: Tue Aug 20, 2013 4:04 pm

Re: What encoding is used to display the contents of a docx file

Postby ToddFiske on Thu Aug 22, 2013 7:09 pm

You need to look at an ASCII chart, and learn a little bit more about hexadecimal and other numbering systems. You might also do well to find a binary editor or file viewer to look at your files with.

2D is the hex representation of the ASCII value for a hyphen, 45 in decimal.

I have seen the code in Sublime referred to as binary, but it clearly isn't binary. It looks like hex, but then there are no "0x" characters in there.


"Binary" is often used in a general way to mean anything that isn't text. However everything on a computer is represented in binary at some point, even text. Everything is represented in memory as 0 and 1 (or -5 and +5 volts). Hex is commonly used to show binary data since it would be extremely cumbersome to actually show everything in base 2 (eg, 2D vs 00101101).

The leading 0x characters are only necessary when writing hexadecimal values in a programming language or other places to distinguish them from text or numbers in other bases. 0x2D and 2d mean the same thing in this case. The "binary" view of a file in SublimeText shows the contents as hexadecimal words (16-bit values). Adding all of the "0x" characters would just be noise so this is left off in these types of displays.

About these two files, I looked at them in a file comparison program (BeyondCompare from Scooter Software) and saw that the corrupted version ends with a line feed (0x0A) and 9 hyphens. Something in your process is appending these bytes. Do you print "\n---------" anywhere in your code? Your two files are 10180 and 10190 bytes for the non-corrupted and corrupted versions respectively. You may be seeing different amounts on different files because something is padding them up to a multiple of the file's size. Look at different sized input files and compare the amounts of hyphens that show up and see if you find any pattern.

What kind of API is it, and what programming language are you using?

You can translate a hex value into it's corresponding character by entering something like this in the SublimeText console (press Ctrl+~):

Code: Select all
    s = '2D'
    print(chr(int(s,16))

    # or
    n = 0x2D
    print(n)
    print(chr(n))

    # going the other way
    s = '-'
    print(ord(s))
    print(hex(s))
ToddFiske
 
Posts: 38
Joined: Wed Nov 04, 2009 10:43 pm

Re: What encoding is used to display the contents of a docx file

Postby mhl on Fri Aug 23, 2013 6:30 pm

Hi Todd,

Thank you ever so much for the explanations, each with the perfect level of detail. Bang on, much appreciated.

The post is to a job board API and we're sending over resumes. We're working with (classic) ASP. We did have \n------ set as part of a mime boundary - that was fixed along with another issue, which is otherwise not relevant.

You may be seeing different amounts on different files because something is padding them up to a multiple of the file's size. Look at different sized input files and compare the amounts of hyphens that show up and see if you find any pattern.


It ALMOST seems like that. After fixing the mime boundary issue, we now have the seemingly random addition of null / zero padding onto the end of files. Except that the length of the padding can differ when sending across the exact same file (with the exact same code). I looked for patterns in the length of extra padding, but I couldn't see anything obvious. For example the first attempt had 14 lots of "00". Press refresh and the next send had 8. The next had 23. And so on.

The destination URL for the post is https, a friend suggested that our server recognised this and was adding the random null padding as part of the encryption. This sounds kind of unlikely, but I don't have any better suggestions.

FWIW there is also an upload form to add files to the API. That works perfectly. So I think it's something OUR server is doing as it sends. I'm all out of ideas though. If anybody can suggest a likely avenue for investigation...
mhl
 
Posts: 5
Joined: Tue Aug 20, 2013 4:04 pm

Re: What encoding is used to display the contents of a docx file

Postby ToddFiske on Fri Aug 23, 2013 9:54 pm

It sounds like it's most likely in your code somewhere since APIs and other systems don't generally do random things like that. However, I think we may be getting outside the scope of SublimeText so if you like you can email me about it. I sent my details in a PM.
ToddFiske
 
Posts: 38
Joined: Wed Nov 04, 2009 10:43 pm


Return to General Discussion

Who is online

Users browsing this forum: tux. and 17 guests