29 January 2013

Searching for NULL: Making hg and git recognize text as text.


Recently, my boss sent me her version of the LaTeX source for the paper we're working on together, which I then proceeded to enter into the Mercurial repository. (Why she herself isn't using Hg is a story for another time, but it's also not that hard to do her commits for her, I just make sure to track which commit is the parent.) I wanted to see what changes she had made, so I did "hg diff" and was informed that I was comparing a binary file. I tried both "traditional" diff and git-diff and both attempted to handle the file as binary. However, I was still able to open the file in a text editor without any problems. I had read various places that tbe BOM on UTF-16 could cause problems and so I made sure that I was saving the file as UTF-8 without BOM (UTF-8 is a must for us since we had some German examples with Umlauts and Esszet ß). I was growing increasingly frustrated and was about to just damn the torpedoes and commit anyway -- the diffs are calculated and stored the same way, regardless of whether or not the file is binary; only the display is adjusted -- when I read that the presence of the NULL byte was one of the ways that a file is determined to be binary. So I found a way to remove NULL and everything worked as desired. Still, I was kinda curious about where exactly there were NULLs in this file. I so searched a bit more and found a way to grep them and it turns out that NULL was being inserted as the very last byte in the file. Whether this was an issue with the text editor on my boss's end or a by-product of encoding 8-bit formats into 7-bits for email -- especially given that her emails are usually encoded in Western ISO 8859-1, which means that two different 8-bit formats were being encoded into 7-bit ASCII -- I don't know. Anyway, here's a summary of ways to deal with NULL in plain text files.

No comments:

Post a Comment