Posts Tagged ‘computing’

Startling lack of training in statistical computing

14 Mar 2014

It is shocking to me that a statistics department would offer a graduate-level statistical computing course only every fourth year.

I had been arguing for a statistical programming course: that we supplement the usual course on the theory of statistical computing (numerical linear algebra, EM algorithm, MCMC, etc.) with a course on the practice of statistical computing.

But I was assuming that the more theoretical statistical computing course was actually being taught.

If a department teaches a course at a frequency less than every-other-year, it’s unavailable to many students, or it comes too late in their training to be useful. And statistical computing should really be considered part of a statistics department’s core curriculum.

Update: As you might have anticipated, I’ve been asked to teach the course.

Googling errors

14 Feb 2014

@roguelynn tweeted the other day:

If attendees of this weekend’s intro to python workshop leave with one thing, it’ll be to Google your error messages first and foremost.

I had just talked about the technique in my Tools for Reproducible Research course, and I had a few recent examples.

Gtk-WARNING **: cannot open display:

I was logged into a department server, trying to clone a private repository from GitHub, and got an error like

(gnome-ssh-askpass:1731): Gtk-WARNING **: cannot open display:

I googled that, and the first item was a stackoverflow question, whose answer said “unset SSH_ASKPASS”, which totally worked.

except KeyError, k: raise AttributeError, k

AsciiDoc was giving me this error:

asciidoc -a data-uri -a toc -a toclevels=4 -a num example2.txt
  File "/usr/local/bin/asciidoc", line 101
    except KeyError, k: raise AttributeError, k
SyntaxError: invalid syntax

Google the “except KeyError” line, and you get to a Q&A on the AsciiDoc google group, which says “Asciidoc is Python 2, not 3.”

mclapply isn’t working in windows

I got a report that parallel processing in my R/qtl package wasn’t working in Windows.

I googled “mclapply isn’t working windows” (because mclapply was the function I was using) and got this stackoverflow page, which says:

since Windows does not have fork(), it will run standard lapply instead – no parallelization

A course in statistical programming

25 May 2012

Graduate students in statistics often take (or at least have the opportunity to take) a statistical computing course, but often such courses are focused on methods (like numerical linear algebra, the EM algorithm, and MCMC) and not on actual coding.

For example, here’s a course in “advanced statistical computing” that I taught at Johns Hopkins back in 2001.

Many (perhaps most) good programmers learned to code outside of formal courses. But many statisticians are terrible programmers and would benefit by a formal course.

Moreover, applied statisticians spend the vast majority of their time interacting with a computer and would likely benefit from more formal presentations of how to do it well. And I think this sort of training is particularly important for ensuring that research is reproducible.

One really learns to code in private, struggling over problems, but I benefited enormously from a statistical computing course I took from Phil Spector at Berkeley.

Brian Caffo, Ingo Ruczinski, Roger Peng, Rafael Irizarry, and I developed a statistical programming course at Hopkins that (I think) really did the job.

I would like to develop a similar such course at Wisconsin: on statistical programming, in the most general sense.

I have in mind several basic principles:

  • be self-sufficient
  • get the right answer
  • document what you did (so that you will understand what you did 6 months later)
  • if primary data change, be able to re-run the analysis without a lot of work
  • are your simulation results reproducible?
  • reuse of code (others’ and your own) rather than starting from scratch every time
  • make methods accessible to (and used by) others

Here are my current thoughts about the topics to include in such a course. The key aim would be to make students aware of the basic principles and issues: to give them a good base from which to learn on their own. Homework would include interesting and realistic programming assignments plus create a Sweave-type document and an R package.

  • Basic unix tools (find; df; top; ps ux; grep); unix on Mac and windows
  • Emacs/vim/other editors (rstudio/eclipse)
  • Latex (for papers; for presentations)
  • slides for talks; posters; figures/tables
  • Advanced R (fancy data structures; functions; object-oriented stuff)
  • Advanced R graphics
  • R packages
  • Sweave/asciidoc/knitr
  • minimal Perl (or Python or Ruby); example of data manipulation
  • Minimal C (or C++); examples of speed-up
  • version control (eg git or mercurial); backups
  • reproducible research ideas
  • data management
  • managing projects: data, analyses, results, papers
  • programming style (readable, modular); general but not too general
  • debugging/profiling/testing
  • high-throughput computing; parallel computing; managing big jobs
  • finding answers to questions: man pages; documentation; web
  • more on visualization; dynamic graphics
  • making a web page; html & css; simple cgi-type web forms?
  • writing and managing email
  • managing references to journal articles