29 January 2013

Searching for NULL: Making hg and git recognize text as text.


Recently, my boss sent me her version of the LaTeX source for the paper we're working on together, which I then proceeded to enter into the Mercurial repository. (Why she herself isn't using Hg is a story for another time, but it's also not that hard to do her commits for her, I just make sure to track which commit is the parent.) I wanted to see what changes she had made, so I did "hg diff" and was informed that I was comparing a binary file. I tried both "traditional" diff and git-diff and both attempted to handle the file as binary. However, I was still able to open the file in a text editor without any problems. I had read various places that tbe BOM on UTF-16 could cause problems and so I made sure that I was saving the file as UTF-8 without BOM (UTF-8 is a must for us since we had some German examples with Umlauts and Esszet ß). I was growing increasingly frustrated and was about to just damn the torpedoes and commit anyway -- the diffs are calculated and stored the same way, regardless of whether or not the file is binary; only the display is adjusted -- when I read that the presence of the NULL byte was one of the ways that a file is determined to be binary. So I found a way to remove NULL and everything worked as desired. Still, I was kinda curious about where exactly there were NULLs in this file. I so searched a bit more and found a way to grep them and it turns out that NULL was being inserted as the very last byte in the file. Whether this was an issue with the text editor on my boss's end or a by-product of encoding 8-bit formats into 7-bits for email -- especially given that her emails are usually encoded in Western ISO 8859-1, which means that two different 8-bit formats were being encoded into 7-bit ASCII -- I don't know. Anyway, here's a summary of ways to deal with NULL in plain text files.

Diffs for LaTeX in Version Control

I use Mercurial to track changes in my LaTeX documents. While there's latexdiff and the older texdiff to produce a conveniently marked up difference document (like Track Changes in Word or OpenOffice), those depend on having both versions available at the same time -- a bit of a pain when using version control. You have to update to the old version, rename it, update to the new version and then compare them -- far from trivial for documents with many files. Well, now there's a convenient utility to do that for you with Mercurial and Git, scm-latexdiff. Check it out.



Pushing all bookmarks in mercurial