Graduate students in statistics often take (or at least have the opportunity to take) a statistical computing course, but often such courses are focused on methods (like numerical linear algebra, the EM algorithm, and MCMC) and not on actual coding.
For example, here’s a course in “advanced statistical computing” that I taught at Johns Hopkins back in 2001.
Many (perhaps most) good programmers learned to code outside of formal courses. But many statisticians are terrible programmers and would benefit by a formal course.
Moreover, applied statisticians spend the vast majority of their time interacting with a computer and would likely benefit from more formal presentations of how to do it well. And I think this sort of training is particularly important for ensuring that research is reproducible.
One really learns to code in private, struggling over problems, but I benefited enormously from a statistical computing course I took from Phil Spector at Berkeley.
Brian Caffo, Ingo Ruczinski, Roger Peng, Rafael Irizarry, and I developed a statistical programming course at Hopkins that (I think) really did the job.
I would like to develop a similar such course at Wisconsin: on statistical programming, in the most general sense.
I have in mind several basic principles:
- be self-sufficient
- get the right answer
- document what you did (so that you will understand what you did 6 months later)
- if primary data change, be able to re-run the analysis without a lot of work
- are your simulation results reproducible?
- reuse of code (others’ and your own) rather than starting from scratch every time
- make methods accessible to (and used by) others
Here are my current thoughts about the topics to include in such a course. The key aim would be to make students aware of the basic principles and issues: to give them a good base from which to learn on their own. Homework would include interesting and realistic programming assignments plus create a Sweave-type document and an R package.
- Basic unix tools (find; df; top; ps ux; grep); unix on Mac and windows
- Emacs/vim/other editors (rstudio/eclipse)
- Latex (for papers; for presentations)
- slides for talks; posters; figures/tables
- Advanced R (fancy data structures; functions; object-oriented stuff)
- Advanced R graphics
- R packages
- Sweave/asciidoc/knitr
- minimal Perl (or Python or Ruby); example of data manipulation
- Minimal C (or C++); examples of speed-up
- version control (eg git or mercurial); backups
- reproducible research ideas
- data management
- managing projects: data, analyses, results, papers
- programming style (readable, modular); general but not too general
- debugging/profiling/testing
- high-throughput computing; parallel computing; managing big jobs
- finding answers to questions: man pages; documentation; web
- more on visualization; dynamic graphics
- making a web page; html & css; simple cgi-type web forms?
- writing and managing email
- managing references to journal articles
Tags: computing, reproducible research, teaching
25 May 2012 at 11:25 pm |
Hi: I just glanced but that’s a really nice course. I graduated with
a Ph.D in statistics in 2000 and I wish I had taken a course like
that. ( there was no class like that or I would have ).
I’m sort of up to speed now on a decent %age of the areas you
cover ( 12 years later ) but it would have been so much better
to have had a course like that. thanks for the link. I hope
to get back to it in detail at some point.
mark
25 May 2012 at 11:32 pm |
Excellent list. I couldn’t agree more. I would stick in relational databases and SQL under data management. I am tired of training fresh economics and statistics PhD graduates in basic computing and programming skills.
26 May 2012 at 7:12 pm |
Absolutely. Knowing how to store and organize data is just as important as knowing how to analyze it.
26 May 2012 at 8:54 am |
Reading through your syllabus in “statistical programming”, I don’t see the statistics! Nope, no distributions, no hypothesis tests. What you’ve done here is a designed a course that should be called Data Science! We need to get all scientists working this way, not just those whose work involves means and variances.
26 May 2012 at 1:49 pm |
You might want to look at http://software-carpentry.org/ to see how much overlap there is …
26 May 2012 at 9:33 pm |
Thanks! That’s a great resource.
30 May 2012 at 8:09 am |
Software Carpentry is ace. And I agree with their position that version control should be lesson 1. Learning version control has had more of an impact on my software development than anything else.
(Numbers 2 and 3 are “adopt a style guide”, and “functions should only do one thing or be broken down into subfunctions” that together let you write more readable code.)
27 May 2012 at 9:38 pm |
I’m totally self-taught in R, and I had the good fortune to take Software Carpentry with Greg Wilson in 2009. It slingshotted me to a whole new level, and would have been even better if we’d been using R instead of Python (learning a new syntax while learning new concepts is more challenging than necessary). There is a huge need for this sort of thing — it should be required for all graduate students coding anything for their research.
30 May 2012 at 8:00 am |
If you are teaching stats courses, you might want to check out Live-R (http://live-analytics.com/), which is R in a browser.
Aside from not having to worry about gettting R installed on machines, Live-R has extra features to help lecturers, so you can integrate course materials and textbooks, and you can easily share code and workspaces with the students.
(Disclosure: I’m an employee of Live Analytics.)
Live-R is in beta now, but if you’re interested, the sign-up link is
http://live-analytics.com/?page_id=2044
and I’ll make sure you get priority access.
11 May 2013 at 8:54 pm |
Any chance this idea is gonna happen, Karl?
13 May 2013 at 10:14 am |
There’s talk, but probably not soon enough for it to be useful to you, unfortunately.