Archive for May, 2012

A course in statistical programming

25 May 2012

Graduate students in statistics often take (or at least have the opportunity to take) a statistical computing course, but often such courses are focused on methods (like numerical linear algebra, the EM algorithm, and MCMC) and not on actual coding.

For example, here’s a course in “advanced statistical computing” that I taught at Johns Hopkins back in 2001.

Many (perhaps most) good programmers learned to code outside of formal courses. But many statisticians are terrible programmers and would benefit by a formal course.

Moreover, applied statisticians spend the vast majority of their time interacting with a computer and would likely benefit from more formal presentations of how to do it well. And I think this sort of training is particularly important for ensuring that research is reproducible.

One really learns to code in private, struggling over problems, but I benefited enormously from a statistical computing course I took from Phil Spector at Berkeley.

Brian Caffo, Ingo Ruczinski, Roger Peng, Rafael Irizarry, and I developed a statistical programming course at Hopkins that (I think) really did the job.

I would like to develop a similar such course at Wisconsin: on statistical programming, in the most general sense.

I have in mind several basic principles:

  • be self-sufficient
  • get the right answer
  • document what you did (so that you will understand what you did 6 months later)
  • if primary data change, be able to re-run the analysis without a lot of work
  • are your simulation results reproducible?
  • reuse of code (others’ and your own) rather than starting from scratch every time
  • make methods accessible to (and used by) others

Here are my current thoughts about the topics to include in such a course. The key aim would be to make students aware of the basic principles and issues: to give them a good base from which to learn on their own. Homework would include interesting and realistic programming assignments plus create a Sweave-type document and an R package.

  • Basic unix tools (find; df; top; ps ux; grep); unix on Mac and windows
  • Emacs/vim/other editors (rstudio/eclipse)
  • Latex (for papers; for presentations)
  • slides for talks; posters; figures/tables
  • Advanced R (fancy data structures; functions; object-oriented stuff)
  • Advanced R graphics
  • R packages
  • Sweave/asciidoc/knitr
  • minimal Perl (or Python or Ruby); example of data manipulation
  • Minimal C (or C++); examples of speed-up
  • version control (eg git or mercurial); backups
  • reproducible research ideas
  • data management
  • managing projects: data, analyses, results, papers
  • programming style (readable, modular); general but not too general
  • debugging/profiling/testing
  • high-throughput computing; parallel computing; managing big jobs
  • finding answers to questions: man pages; documentation; web
  • more on visualization; dynamic graphics
  • making a web page; html & css; simple cgi-type web forms?
  • writing and managing email
  • managing references to journal articles


24 May 2012

The other day, I was sitting on campus, in the woods overlooking Lake Mendota, reading a journal article, and a campus police officer came up to me.

The conversation went something like this:

Police: What’s going on?
Me: I’m enjoying the day.
Police: Are you staff?
Me: I’m faculty.
Police: I’m enjoying the day, too.

And then he walked off.

I’m not sure who was more suspicious, him of me or me of him. Perhaps he was thinking through the Prof or Hobo quiz.

What should I do badly?

23 May 2012

One of the more painful things to learn as a faculty member is to accept that you will do some things badly. There are too many things to do, so you can’t do them all well. What should you do badly?

As an undergraduate, the scope of work is reasonably precisely defined, and problem sets have relatively simple solutions. I finished things on time or even early, and I seldom felt guilty for not studying.

In the first year of graduate school, problem sets are much harder, but they still all have solutions. That there must be a short solution to every problem on an exam is generally the most important thing to recognize in attempting to solve them.

Later in graduate school, identifying a solvable problem becomes part of the research effort, but you can focus all of your effort on that work and so do everything well. For me, there was little need for time management, since I had loads of time.

Even as a postdoc, when I’d started to do many more things at once, I still had few responsibilities aside from my own work, and so I could do everything well. And basically what you need to get done is just defined as what you actually get done.

Faculty life is quite a bit different. I’m making real commitments to people. And then there’s teaching, and reviews of papers and grants, and committees, and letters of recommendation or evaluation. My personal research efforts often are pushed aside in favor of rush-rush collaborative work. Or I’ll spend weeks doing little but read and comment on others’ work.

There’s just too much to do and not enough time to do it as well as you’d like.

So, what should I do badly?

Well, definitely not my own research papers; those will outlast me. Maybe some day someone will read them. Similarly, not seminars I give, as through such research talks, I may gain a reader. I suppose family, health, and sleep should be mentioned here. And biking.


  • Committee work. If someone says, “I enjoyed reading the report you wrote”, you’ve clearly been focusing on the wrong things.
  • Reviews of papers. Sure, I want to help out the authors, but I can’t be doing their research for them.
  • Teaching. If you try to do your best, you’ll do nothing else. The students may notice the difference between just dusting off last year’s notes and actually preparing (and fixing past mistakes), but will they really learn much more in the latter case? To learn, they must struggle a bit, e.g., through the errors in my notes. Is that just a rationalization?
  • Reviews of grants. Someone’s livelihood is at stake, but does it really matter if the text of my review is awkwardly phrased?
  • My own grant. Some of my colleagues will start writing a grant months in advance. I say: if you start months in advance, then you’re going to spend months on the thing! You should start as late as possible. A perfect score on a grant is an indication that you spent too much time on it. The best grant score is the worst possible that still gets funded.

As I said, it’s painful to do things that you know are crap (i.e., could be much better). It’s hard to accept one’s limitations.

Should I be nice?

18 May 2012

I got the following email.

Subject: i have a question?
Date: May 18, 2012 7:57:56 AM CDT

how can i enter the data of QTL analysis.

That was the whole thing.

I presume that the writer wishes to use my R/qtl software.

I could probably respond helpfully (for example, “See the sample data files and code at the R/qtl web site.”), but can’t I expect more from people seeking my help?

I suppose there are three options:

  • Just answer the inferred question.
  • Answer the inferred question, but in an offensive way.
  • Ask the writer to provide further details.

I chose the third option, but I probably should have chosen the first. A fabulously useful person who I greatly admire is well known for his use of the middle option.


I responded

Your question is not answerable without further details.

to which the correspondent replied

How can i analysis data the by RQTL?

I responded with links to tutorials, sample data files, and my book with Śaunak Sen.


17 May 2012

I’m frequently picking pieces of paper up off the floor — bits of the kids’ artwork. Today I found one with this:

That’s actually mine, so I thought I’d share it. Here’s a link to the full page.

These scratchings are related to a bit of theory I’ve been working on (with considerable help), regarding phylogenetic trees. It’s part of this paper, which I’m getting ready to re-submit.

Genetic testing of sperm donors

16 May 2012

A recent New York Times article discusses the potential need for genetic screening of sperm donors. A couple of points struck me as rather odd.

One, the mention of fragile X syndrome, has been corrected:

An earlier version of this article erroneously included fragile X syndrome in a list of genetic diseases that have affected children conceived with donated sperm. (Although a sperm donor who is a carrier of the fragile X premutation may pass it along to his daughters, the disease, which affects twice as many boys as girls, is almost always inherited from the mother.)

The other:

Sperm donors are no more likely to carry genetic diseases than anybody else, but they can father a far greater number of children: 50, 100 or even 150, each a potential inheritor of flawed genes, and each a vector for making those genes more pervasive in the general population.

But the frequency of offspring of sperm donors with genetic diseases would be the same whether there were many more donors with one child each or fewer donors with many children each, and so it’s not clear how sperm donors fathering many children is an argument for genetic testing, except that with fewer donors for the same number of offspring, one would not need to perform as many genetic tests, and so there would be a considerable cost savings.

The article led with a story about a couple whose child, fathered by a sperm donor, has cystic fibrosis. But the chance of that is the same whether or not the sperm came from a donor or in the usual way.

No commuter bikes on sidewalk

8 May 2012

I’m pretty sure you won’t find a street sign like this in Baltimore:

"No commuter bikes on sidewalk"

It’s on my usual commute to campus.

Boring us to submission

7 May 2012

The University of Wisconsin-Madison has embarked on a big human resources redesign. I think the university administration has devised the ideal strategy to ensure little faculty input in the process: they’ve made it so dreadfully dull and full of business speak that we can’t bear to read through the material or show up to one of the many “engagement sessions”.

What reasonable faculty member would seek to attend an “engagement session”?

And then there are diagrams like this:

And stuff like this:

When they talk about market-based blah blah blah, my eyes glaze over and I want to take a nap.

There was a great deal of discussion of this stuff at the faculty senate meeting today.

Noah Feinstein asked an important question: on what evidence are the various recommendations based? I’m pretty sure it involves a lot of exploding pie charts.

Sara Goldrick-Rab made the most interesting point of the meeting: the market doesn’t reward good teaching. She has much more to say at her blog.

That I’ve been introduced to Sara’s blog may be viewed as an early positive outcome of the process.

Rob Tibshirani and Andy Clark named to NAS

3 May 2012

Rob Tibshirani and Andy Clark are now members of the National Academy of Sciences.