A course in statistical programming

Graduate students in statistics often take (or at least have the opportunity to take) a statistical computing course, but often such courses are focused on methods (like numerical linear algebra, the EM algorithm, and MCMC) and not on actual coding.

For example, here’s a course in “advanced statistical computing” that I taught at Johns Hopkins back in 2001.

Many (perhaps most) good programmers learned to code outside of formal courses. But many statisticians are terrible programmers and would benefit by a formal course.

Moreover, applied statisticians spend the vast majority of their time interacting with a computer and would likely benefit from more formal presentations of how to do it well. And I think this sort of training is particularly important for ensuring that research is reproducible.

One really learns to code in private, struggling over problems, but I benefited enormously from a statistical computing course I took from Phil Spector at Berkeley.

Brian Caffo, Ingo Ruczinski, Roger Peng, Rafael Irizarry, and I developed a statistical programming course at Hopkins that (I think) really did the job.

I would like to develop a similar such course at Wisconsin: on statistical programming, in the most general sense.

I have in mind several basic principles:

  • be self-sufficient
  • get the right answer
  • document what you did (so that you will understand what you did 6 months later)
  • if primary data change, be able to re-run the analysis without a lot of work
  • are your simulation results reproducible?
  • reuse of code (others’ and your own) rather than starting from scratch every time
  • make methods accessible to (and used by) others

Here are my current thoughts about the topics to include in such a course. The key aim would be to make students aware of the basic principles and issues: to give them a good base from which to learn on their own. Homework would include interesting and realistic programming assignments plus create a Sweave-type document and an R package.

  • Basic unix tools (find; df; top; ps ux; grep); unix on Mac and windows
  • Emacs/vim/other editors (rstudio/eclipse)
  • Latex (for papers; for presentations)
  • slides for talks; posters; figures/tables
  • Advanced R (fancy data structures; functions; object-oriented stuff)
  • Advanced R graphics
  • R packages
  • Sweave/asciidoc/knitr
  • minimal Perl (or Python or Ruby); example of data manipulation
  • Minimal C (or C++); examples of speed-up
  • version control (eg git or mercurial); backups
  • reproducible research ideas
  • data management
  • managing projects: data, analyses, results, papers
  • programming style (readable, modular); general but not too general
  • debugging/profiling/testing
  • high-throughput computing; parallel computing; managing big jobs
  • finding answers to questions: man pages; documentation; web
  • more on visualization; dynamic graphics
  • making a web page; html & css; simple cgi-type web forms?
  • writing and managing email
  • managing references to journal articles

Tags: , ,

11 Responses to “A course in statistical programming”

  1. mark leeds Says:

    Hi: I just glanced but that’s a really nice course. I graduated with
    a Ph.D in statistics in 2000 and I wish I had taken a course like
    that. ( there was no class like that or I would have ).

    I’m sort of up to speed now on a decent %age of the areas you
    cover ( 12 years later ) but it would have been so much better
    to have had a course like that. thanks for the link. I hope
    to get back to it in detail at some point.


  2. Mel. B. Says:

    Excellent list. I couldn’t agree more. I would stick in relational databases and SQL under data management. I am tired of training fresh economics and statistics PhD graduates in basic computing and programming skills.

    • Justin R. Says:

      Absolutely. Knowing how to store and organize data is just as important as knowing how to analyze it.

  3. Barry Rowlingson Says:

    Reading through your syllabus in “statistical programming”, I don’t see the statistics! Nope, no distributions, no hypothesis tests. What you’ve done here is a designed a course that should be called Data Science! We need to get all scientists working this way, not just those whose work involves means and variances.

  4. Be Bolker Says:

    You might want to look at http://software-carpentry.org/ to see how much overlap there is …

    • Karl Broman Says:

      Thanks! That’s a great resource.

    • richierocks Says:

      Software Carpentry is ace. And I agree with their position that version control should be lesson 1. Learning version control has had more of an impact on my software development than anything else.

      (Numbers 2 and 3 are “adopt a style guide”, and “functions should only do one thing or be broken down into subfunctions” that together let you write more readable code.)

  5. Sarah Henderson Says:

    I’m totally self-taught in R, and I had the good fortune to take Software Carpentry with Greg Wilson in 2009. It slingshotted me to a whole new level, and would have been even better if we’d been using R instead of Python (learning a new syntax while learning new concepts is more challenging than necessary). There is a huge need for this sort of thing — it should be required for all graduate students coding anything for their research.

  6. richierocks Says:

    If you are teaching stats courses, you might want to check out Live-R (http://live-analytics.com/), which is R in a browser.

    Aside from not having to worry about gettting R installed on machines, Live-R has extra features to help lecturers, so you can integrate course materials and textbooks, and you can easily share code and workspaces with the students.

    (Disclosure: I’m an employee of Live Analytics.)

    Live-R is in beta now, but if you’re interested, the sign-up link is
    and I’ll make sure you get priority access.

  7. wesesque Says:

    Any chance this idea is gonna happen, Karl?

Comments are closed.