Initial steps towards reproducible research

4 Dec 2014

In anticipation of next week’s Reproducible Science Hackathon at NESCent, I was thinking about Christie Bahlai’s post on “Baby steps for the open-curious.”

Moving from Ye Olde Standard Computational Science Practice to a fully reproducible workflow seems a monumental task, but partially reproducible is better than not-at-all reproducible, and it’d be good to give people some advice on how to get started – to encourage them to get started.

So, I spent some time today writing another of my minimal tutorials, on initial steps towards reproducible research.

It’s a bit rough, and it could really use some examples, but it helped me to get my thoughts together for the Hackathon and hopefully will be useful to people (and something to build upon).

The value of thesis intro/discussion

3 Dec 2014

Last week, Kelly Weinersmith tweeted:

Is any task a more monumental waste of time than writing an introduction and discussion for a dissertation where the chapters are published?

I think many (or most?) of my colleagues would agree with her. The research and the papers are the important things, and theses are hardly read. Why spend time writing chapters that won’t be read?

My response was:

Intro & disc of thesis get the student to think about the broader context of their work.

I’d like to expand on that just a bit.

In the old days, a PhD dissertation was more of a monograph. The new style is to have three or so papers (published or ready-to-submit) as chapters, sandwiched between introductory and discussion chapters. Those intro and discussion chapters are sometimes quite thin. I would prefer them to be more substantial.

The focus on papers is a good thing, as they will be easier to find and more widely read. But a thesis/dissertation is not just a research product, but also a vehicle to get a student to think more deeply and broadly.

The individual papers will include introductory and discussion sections, but journal articles tend to be aimed towards a relatively narrow and specialized audience. More substantive introductory and discussion chapters can help to make the work accessible to a broader audience. They also help to tie the separate papers together: what is the larger scientific context, and how do these pieces of work fit into that?

I don’t want students wasting time on “busy work,” and writing a thesis does seem like busy work. But I think a thesis deserves more than a ten-paragraph introduction. And the value of that introduction is not so much in demonstrating the student’s knowledge, but in being part of the development of that knowledge.

Car crash stats revisited: My measurement errors

3 Nov 2014

Last week, I created revised versions of graphs of car crash statistics by state (including an interactive version), from a post by Mona Chalabi at 538.

Since I was working on those at the last minute in the middle of the night, to be included as an example in a lecture on creating effective figures and tables, I just read the data off printed versions of the bar charts, using a ruler.

I later emailed Mona Chalabi, and she and Andrew Flowers quickly posted the data to github.com/fivethirtyeight/data. (That repository has a lot of interesting data, and if you see data at 538 that you’re interested in, just ask them!)

I was curious to look at how I’d done with my measurements and data entry. Here’s a plot of my percent errors:

Percent measurement errors in Karl's car crash stats

Not too bad, really. Here are the biggest problems:

  • Mississippi, non-distracted: off by 6%, but that corresponded to 0.5 mm.
  • Rhode Island and Ohio, speeding: off by 40 and 35%, respectively. I’d written down 8 and 9 mm rather than 13 and 14 mm.
  • Maine and Indiana, alcohol: wrote 15.5 and 14.5 mm, but typed 13.5 and 13 mm. In the former, I think I just misinterpreted my writing; in the latter, I think I wrote the number for the state below (Iowa).

It’s also interesting to note that my “total” and “non-distracted” were almost entirely under-estimates: probably an error in the measurement of the overall width of the bar chart.

Also note: @brycem had recommended using WebPlotDigitizer for digitizing data from images.

Interactive plot of car crash stats

30 Oct 2014

I spent the afternoon making a D3-based interactive version of the graphs of car crash statistics by state that I’d discussed yesterday: my attempt to improve on the graphs in Mona Chalabi‘s post at 538.

Screen shot of interactive graph of car crash statistics

See it in action here.

Code on github.

Scholarly Publishing Symposium at UW-Madison

30 Oct 2014

At the Scholarly Publishing Symposium at UW-Madison today. Has interesting list of supplemental materials, but apparently only on paper:

Supplemental materials from UW-Madison Scholarly Publishing Symposium

So here they are electronically.

Improved graphs of car crash stats

29 Oct 2014

Last week, Mona Chalabi wrote an interesting post on car crash statistics by state, at fivethirtyeight.com.

I didn’t like the figures so much, though. There were a number of them like this:

chalabi-dearmona-drinking

I’m giving a talk today about data visualization [slides | github], and I thought this would make a good example, so I spent some time creating versions that I like better.
Read the rest of this entry »

Error notifications from R

4 Sep 2014

I’m enthusiastic about having R notify me when my script is done.

But among my early uses of this, my script threw an error, and I never got a text or pushbullet about that. And really, I’m even more interested in being notified about such errors than anything else.

It’s relatively easy to get notified of errors. At the top of your script, include code like options(error = function() { } )

Fill in the function with your notification code. If there’s an error, the error message will be printed and then that function will be called. (And then the script will halt.)

You can use geterrmessage() to grab the error message to include in your notification.

For example, if you want to use RPushbullet for the notification, you could put, at the top of your script, something like this:

options(error = function() { 
                    library(RPushbullet)
                    pbPost("note", "Error", geterrmessage())
                })

Then if the script gives an error, you’ll get a note with title “Error” and with the error message as the body of the note.

Update: I knew I’d heard about this sort of thing somewhere, but I couldn’t remember where. Duh; Rasmus mentioned it on twitter just a couple of days ago! Fortunately, he reminded me of that in the comments below.

Another update: Ian Kyle pointed out in the comments that the above function, if used in a script run with R CMD BATCH, won’t actually halt the script. The simplest solution is to add stop(geterrmessage()), like this:

options(error = function() { 
                    library(RPushbullet)
                    pbPost("note", "Error", geterrmessage())
                    if(!interactive()) stop(geterrmessage())
                })

Notifications from R

3 Sep 2014

You just sent a long R job running. How to know when it’s done? Have it notify you by beeping, sending you a text, or sending you a notification via pushbullet.

Read the rest of this entry »

The mustache photo

28 Aug 2014

A certain photo of me has been following me around for some time.

Karl with a mustache, 15 Nov 2002

The thing is sitting on my website, so I suppose I have only myself to blame. I actually quite like the photo. I look happy. I was happy. I’m not always happy.

Read the rest of this entry »

Yet another R package primer

28 Aug 2014

Hadley Wickham is writing what will surely be a great book about the basics of R packages. And Hilary Parker wrote a very influential post on how to write an R package. So it seems like that topic is well covered.

Nevertheless, I’d been thinking for some time that I should write another minimal tutorial with an alliterative name, on how to turn R code into a package. And it does seem valuable to have a diversity of resources on such an important topic. (R packages are the best way to distribute R code, or just to keep track of your own personal R code, as part of a reproducible research process.)

So I’m going ahead with it, even though it doesn’t seem necessary: the R package primer.

It’s not completely done, but the basic stuff is there.


Follow

Get every new post delivered to your Inbox.

Join 98 other followers