31 May 2013

Speed up R (on Linux)

You can drastically speed up R in many cases by using a better/more tuned BLAS (linear algebra library) implementation. ATLAS (Automatically Tuned Linear Algebra System) is one such option, but the basic compiled version distributed with Debian and Ubuntu won’t help you out much — you get the most benefit when you compile it yourself so that it’s optimally tuned for your system. (This is also why Debian stopped distributing semi-optimized builds as binary packages.)

No worries though, installing ATLAS from source isn’t hard, and Debian and Ubuntu even maintain a source package.

On Ubuntu and Debian, installing ATLAS goes likes this:

user@localhost:~$ sudo apt-get build-dep atlas
user@localhost:~$ sudo apt-get install  build-essential dpkg-dev cdbs devscripts gfortran liblapack-dev liblapack-pic 
user@localhost:~$ sudo apt-get source atlas
user@localhost:~$ cd atlas*
user@localhost:~$ sudo fakeroot debian/rules custom

If you get a warning about cpufreq not being set to performance mode, then you’ll have to change that for each CPU (numbered from 0). This will turn off frequency-scaling, which offers a speed boost in and of itself, but perhaps isn’t the best option for laptops as it increases power consumption.

Set n equal to the number of cpus (including virtual HT cpus) minus one:

user@localhost:~$ for i in {0..n}; do sudo cpufreq-set -g performance -c$i; done 

After you set your CPU to performance mode, you can try again:

user@localhost:~$ sudo fakeroot debian/rules custom
user@localhost:~$ sudo apt-get install libatlas-base-dev libatlas-base
user@localhost:~$ sudo dpkg -i libatlas3gf-*.deb 

I recommend using tab completion instead of wildcards, but it should now be installed. You can switch back and forth between different BLAS implementations (and see which one is active) with:

user@localhost:~$ sudo update-alternatives --config libblas.so.3

There are 2 choices for the alternative libblas.so.3 (providing /usr/lib/libblas.so.3).
Selection    Path                                    Priority   Status
------------------------------------------------------------
* 0            /usr/lib/atlas-base/atlas/libblas.so.3   35        auto mode
  1            /usr/lib/atlas-base/atlas/libblas.so.3   35        manual mode
  2            /usr/lib/libblas/libblas.so.3            10        manual mode

Press enter to keep the current choice[*], or type selection number: 

I was happy with auto-selection favoring ATLAS, so I just hit enter.

Now for the benchmarks using large matrix multiplication in R.

Before:

> a = matrix(rnorm(5000*5000), 5000, 5000) 
> b = matrix(rnorm(5000*5000), 5000, 5000) 
> system.time( a%*%b )
   user  system elapsed 
191.668   0.108 191.828 

After:

> a = matrix(rnorm(5000*5000), 5000, 5000) 
> b = matrix(rnorm(5000*5000), 5000, 5000) 
> system.time(a%*%b)
   user  system elapsed 
 34.726   0.200  17.687

That’s more than a 10x speedup in elapsed time!

Sources

  • http://packages.debian.org/unstable/libatlas3-base
  • http://anonscm.debian.org/viewvc/debian-science/packages/atlas/tags/3.8.4-2/README.Debian?revision=38541&view=markup
  • http://wiki.debian.org/DebianScience/LinearAlgebraLibraries
  • https://gist.github.com/palday/5685150

02 May 2013

Totally Open Science: A Proposal for a New Type of Preregistration


Methodological rigor has been the center of a growing debate in the behavioral and brain sciences.  A big problem thus far is that we've largely only published results. Preregistration forces us to publish methods and hypotheses ahead of time, which can help with p-value hacking, post-hoc storytelling and the "file drawer" method for dealing with negative or unwanted results. Even prominent journals like Cortex are getting in on preregistration with a publication guarantee, effectively focusing peer review on methods and hypotheses and not on "interesting" results. Some journals also require data sharing, including Cortex in its new initiative, by uploading to public hosting services like FigShare.


I want to go one step further and suggest that it's time to share data, method and process. I want every box in that little chart covered, and moreover, I want to be able to look at how we get from box to box.

Preregistration is great and should help us to avoid a lot of post-hoc tomfoolery. But preregistration is difficult to use for certain types of exploratory or simulation-based research. While the reporting of incidental results are still allowed under certain forms of preregistration (including the Cortex) model, purely exploratory studies, including iterative simulation development studies, don't fit in well with preregistration. The Neuroskeptic agrees that such results don't fit the preregistration model per se, but should be marked as exploratory (perhaps implicitly via their missing registration) so that it's clear that any interesting patterns could be the result of careful selection:
We all know that any 1000C finding might be a picked cherry from a rich fruit basket.
By opening up process, we can still learn a lot about the fruit left in the basket.

The following is my proposal for a variant of preregistration compatible with exploratory and simulation-based research. It is based on open access and open source principles and will discourage the post-hoc practices that lead to unreliable results. The key idea is transparency at every step -- making the context of an experiment and an analysis available and apparent not only encourages "honesty" of individual researchers in their claims but also allows us to get away from the binary world of "significance". This is just an initial proposal, so I won't go into all the details and I am aware that there are a few kinks to work out.

Beyond Preregistration: Full Logging


My basic proposal is this: public, append-only tracking of research process and iteration via distributed version control systems like Mercurial and Git.  In essence, this is a form of extensive, semi-automated logging / snapshotting. For the individual user, this also has the nice advantage of allowing you to go back in time to older versions, compare different versions, and even help track down inconsistencies between analyses.



The initial entry in the log should clearly state whether the study is confirmatory or exploratory. Simulatory or not is orthogonal to confirmatory/exploratory: if you're just testing whether a new model fits the data reasonably well, then you should define ahead of time what you mean by "reasonably well" and test that as you would any hypothesis in a real-world experimental investigation. If you're trying to develop a model/simulation in the first place and just want to see how good you can make it for the data at hand, then that is exploratory research and should be marked as such. Texas sharp-shooting is just as problematic, if not more so, in simulation-based research as in research in the real world.

This should then dovetail nicely into a Frontiers type publishing model with an iterative, interactive review. The review process would just be part of the log.

Context and Curiosity


A fundamental problem with our statistics is that we think in binary: "significant" or "not significant" and often completely ignore the context and assumptions of statistical tests. Even xkcd has touched upon the many of the common issues in understanding statistics. Many issues arise from post-hoc thinking, and this is what preregistration tries to prevent. Post-hoc thinking violates statistical assumptions. Odds are there are some interesting patterns in your data that occurred by chance. If you test them after you've already seen that they're there, then you're begging the question. If you report that you found this pattern while doing something else, then it presents a direction for future research. But if you present that pattern as the one you were looking for all along, then you've violated the assumption of randomness that null hypothesis testing is based on.

By recording the little steps, we give our data the full context to understand and interpret them, even if they are "just" exploratory data. Exploratory data has a different context and it's the context we need to fully evaluate a result, and not some label like "significant".

Isaac Asimov supposedly once said that "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka' but 'That’s funny...'".  Even if serendipitous success is the exception and not the rule, we need a forum to get all the data we have out in the open in a way that doesn't distort its meaning.

Some Details 


The following gets a tad more technical, but should make my idea a bit more concrete. There are a lot more details that I have given serious thought to, but won't address here.  

Implementation


More precisely, I'm suggesting something like GitHub or Bitbucket, but with the key difference that history is immutable and repositories cannot be deleted (to prevent ad-hoc mutation via deletion and recreation.) The preregistration component would be the initial commit, in which a README type document would outline the plan. For confirmatory research, this README would follow the same form as preregistration. (Indeed, the initial commit could even be done automatically following a traditional, web-based preregistration form.) For exploratory research (e.g. mining data corpora for interesting patterns as hints for planning confirmatory research), the README would be a description of the corpora (or a description of the planned corpora), including the planned size (i.e. test subjects and trials) of the corpus (optional stopping is bad). For simulation-based research, the README would include a description of the methodology for testing goodness of fit as well as an outline of the theoretical background being implemented computationally (lowering the bar of your model post hoc is bad). Exploratory dead ends would be apparent as unmerged branches.  

As stated above, this should tie nicely into a Frontiers type publishing model with an iterative, interactive review. Publications coming from a particular experimental undertaking would have to be included in the repository (or a fork thereof if you're analyzing somebody else's data), which would make it clear when somebody's been double dipping as well as quickly giving an overview of all relevant publications. As part of this, all the scripts that go into generating figures should be present in the repository. This of course requires that you write a script for everything, even if it's just one line to document exactly what command-line options you gave. 

The open repository nature also supports reproduciblity via extensive documentation/logging and openness of the original data. The latter is also important for "warm-ups" to confirmatory research: getting a good preregistration protocol outlined often requires playing with some existing data beforehand to work out kinks in the design. For exploratory and simulatory work, everything is documented: you know what was tried out, what did and didn't work, as well as both the results of statistical tests and their context, all of which is required to figure out useful future directions. 

A Few Obvious Concerns


Now, there are a few obvious problems that we need to address now, despite me trying to avoid too many details.
  1. The log given by the full repository is far too big to be reviewed in its entirety. This is certainly true, but a full review should rarely be necessary, and the presence of such data would both discourage dishonesty as well as providing a better means to track it down. Of course, this is assuming that people publish the intermediate steps where data were falsified or distorted, but then again, the presence of large, opaque jumps in the log would also be an indicator of something odd going on. ("Jumps" in the logical sense and not necessarily in the temporal sense. Commit messages can and should provide additional context.) For more general perusal or tracking down a particular error source, there are many well-known methods for finding a particular entry -- and many of them are already part of both Mercurial and Git!
  2. Data often comes in large binary formats. I know, neither Mercurial nor Git do too well with large binary files; however, the original data files (measurements) should be immutable, which means that there will be no changes to track. Derived data files (e.g filtered data in EEG research, anything that can be generated from the original measurements) should not be tracked, but their creation should be trivial if all support scripts are included in the repository. This will also reduce the amount of data that has to be hosted somewhere.
  3. Even if we get people to submit to this idea, they can still lie by modifying history before pushing to the central repository. I don't have a full answer to this yet beyond "cheating is always possible, but this system should make it harder." Even under traditional preregistration, it's still possible to cheat by playing with time stamps on your files and preregistering afterwards. Non trivial, but possible. And so it is here. However, as pointed out above, the form of the record should also give some indication that something fishy is going on. Moreover, the initial commit reduces to traditional preregistration in the case of confirmatory research. Finally, this approach is about getting everything out in the sunlight; it does not guarantee publication, if for example, there is a fundamental flaw in your methodology. But the openness may allow somebody to comment and help you before you've gone too far off the path! 

Open (for Comments)


More so than even with traditional preregistration, the system proposed above should encourage and enforce a radical openness in science. For the edge cases of preregistration (exploratory and simulatory work), you can avoid some of the rigidity of preregistation at a heavy price: everything is open and it is very clear that your data is exploratory and indeed it's clear when you found interesting data. It's clear when  you find something after a long fishing expedition, which means it's clear that the result is to be taken with a grain of salt. But it also provides an unbelievably open format for showing people interesting patterns in the data, which potentially support existing research but also demand further investigation with a more focused experiment. 

It's not science if it's not open.

(Special thanks to Jona Sassenhagen for his extensive feedback on previous drafts and long discussions in the office. Long a fan of preregistration, you can find his first foray into the system proposed here on GitHub.)

19 February 2013

Backups: Your Relationship to Your Data and Your IT Guy

My plan for this blog was a series of tips, tricks and suggestions for good practices in using the various pieces of technology I work with on a daily basis as well as my thoughts on using those to support good practices in science. Well, today I have a few tips not just on the technology side, but also on the social side, all inspired by an email I got this morning (loosely translated and anonymized) :

Good Morning [my name misspelled], 
I'm turning to you, because I don't know where else to go. I somehow -- I really don't know how -- managed to delete my pictures folder, and no, I don't have a TimeMachine Backup...[everything is gone, list of important events whose pictures are gone]
I tried the trial version of Data Rescue 3 and saw that the pictures are still "there", but the trial version will only rescue a small amount of data. I really can't afford the 50€ to buy the program at the moment, and who knows, how often you [impersonal -- this is clear in the original] really need it. I'll definitely start using TimeMachine now!
I've asked around and don't know anybody who has such a program. Can you help me? 
It would be really great if you could, because those are really unique and precious memories for me. I'll gladly make sure that you get good, strong coffee this month and next. [This last part sounded better in the original, but the literal meaning is correct.]
Best,
Anna [name changed]

Now the person in question is a passing acquaintance, who worked as the student assistant for a workgroup on the same floor as mine where a few of my friends work, and the computer in question is a personal machine. Finally, I'm not actually in terms of my contract an IT person, I was just more or less drafted into it because I can do it and generally enjoy working with computers.

So that's the baseline information. Now on to what we can learn from all this.  I'm going to discuss:
  1. How deletion works and why programs like Data Rescue 3 can (sometimes) undelete
  2. What this means if you find yourself needing such a program
  3. Why you should still be using real backup software and a few recommendations on that front (i.e. there is no excuse for not backing up given the utilities built in modern OSes)
  4. What we can learn from Anna's experience in terms of dealing with your IT guy (and for the IT guys: how to not come off as a jerk yet not get abused by coworkers) 
This is clearly going to be a long one…

18 February 2013

29 January 2013

Searching for NULL: Making hg and git recognize text as text.


Recently, my boss sent me her version of the LaTeX source for the paper we're working on together, which I then proceeded to enter into the Mercurial repository. (Why she herself isn't using Hg is a story for another time, but it's also not that hard to do her commits for her, I just make sure to track which commit is the parent.) I wanted to see what changes she had made, so I did "hg diff" and was informed that I was comparing a binary file. I tried both "traditional" diff and git-diff and both attempted to handle the file as binary. However, I was still able to open the file in a text editor without any problems. I had read various places that tbe BOM on UTF-16 could cause problems and so I made sure that I was saving the file as UTF-8 without BOM (UTF-8 is a must for us since we had some German examples with Umlauts and Esszet ß). I was growing increasingly frustrated and was about to just damn the torpedoes and commit anyway -- the diffs are calculated and stored the same way, regardless of whether or not the file is binary; only the display is adjusted -- when I read that the presence of the NULL byte was one of the ways that a file is determined to be binary. So I found a way to remove NULL and everything worked as desired. Still, I was kinda curious about where exactly there were NULLs in this file. I so searched a bit more and found a way to grep them and it turns out that NULL was being inserted as the very last byte in the file. Whether this was an issue with the text editor on my boss's end or a by-product of encoding 8-bit formats into 7-bits for email -- especially given that her emails are usually encoded in Western ISO 8859-1, which means that two different 8-bit formats were being encoded into 7-bit ASCII -- I don't know. Anyway, here's a summary of ways to deal with NULL in plain text files.

Diffs for LaTeX in Version Control

I use Mercurial to track changes in my LaTeX documents. While there's latexdiff and the older texdiff to produce a conveniently marked up difference document (like Track Changes in Word or OpenOffice), those depend on having both versions available at the same time -- a bit of a pain when using version control. You have to update to the old version, rename it, update to the new version and then compare them -- far from trivial for documents with many files. Well, now there's a convenient utility to do that for you with Mercurial and Git, scm-latexdiff. Check it out.



Pushing all bookmarks in mercurial