Posts Tagged ‘exploratory data analysis’

Car crash stats revisited: My measurement errors

3 Nov 2014

Last week, I created revised versions of graphs of car crash statistics by state (including an interactive version), from a post by Mona Chalabi at 538.

Since I was working on those at the last minute in the middle of the night, to be included as an example in a lecture on creating effective figures and tables, I just read the data off printed versions of the bar charts, using a ruler.

I later emailed Mona Chalabi, and she and Andrew Flowers quickly posted the data to (That repository has a lot of interesting data, and if you see data at 538 that you’re interested in, just ask them!)

I was curious to look at how I’d done with my measurements and data entry. Here’s a plot of my percent errors:

Percent measurement errors in Karl's car crash stats

Not too bad, really. Here are the biggest problems:

  • Mississippi, non-distracted: off by 6%, but that corresponded to 0.5 mm.
  • Rhode Island and Ohio, speeding: off by 40 and 35%, respectively. I’d written down 8 and 9 mm rather than 13 and 14 mm.
  • Maine and Indiana, alcohol: wrote 15.5 and 14.5 mm, but typed 13.5 and 13 mm. In the former, I think I just misinterpreted my writing; in the latter, I think I wrote the number for the state below (Iowa).

It’s also interesting to note that my “total” and “non-distracted” were almost entirely under-estimates: probably an error in the measurement of the overall width of the bar chart.

Also note: @brycem had recommended using WebPlotDigitizer for digitizing data from images.

Interactive plot of car crash stats

30 Oct 2014

I spent the afternoon making a D3-based interactive version of the graphs of car crash statistics by state that I’d discussed yesterday: my attempt to improve on the graphs in Mona Chalabi‘s post at 538.

Screen shot of interactive graph of car crash statistics

See it in action here.

Code on github.

Improved graphs of car crash stats

29 Oct 2014

Last week, Mona Chalabi wrote an interesting post on car crash statistics by state, at

I didn’t like the figures so much, though. There were a number of them like this:


I’m giving a talk today about data visualization [slides | github], and I thought this would make a good example, so I spent some time creating versions that I like better.

Microarrays suck

25 Apr 2012

Maybe it’s just that I’m stupid and haven’t been paying proper attention to Rafa’s work in the past decade, but microarray data have really been kicking my ass the last few weeks. Here are a few of the lessons that I’ve re-learned.

Lesson 1: What the hell is that?

Many (or all) of my interesting findings have been completely fortuitous results of exploring apparent artifacts in data (autozygosity in the CEPH families, a large, common human inversion polymorphism, and most recently sample mixups).

A couple of weeks ago, I was working on writing up a paper, and saw an unusual plot, and thought, “Hmm, that’s interesting,” and so I made some more plots. And then I saw this:

This is an image of the correlations between samples, across genes, for a study with 500 microarrays on each of 6 tissues. The samples are sorted by sex (female on bottom left; male on top right) and then by ID. I was initially puzzling over the sex difference in other tissues (same-sex pairs are positively correlated; opposite-sex pairs are negatively correlated), but when I saw this tissue, I was struck by the plaid pattern.

Following up on this, it seems that the plaid pattern is the result of a set of bad arrays that should have been but hadn’t been detected before.

Lesson 2: Look

Of course, I wouldn’t have found this array problem if I hadn’t looked. And in my several years working on this project, I hadn’t looked.

I was overwhelmed by the 500 microarrays × 6 tissues × 40,000 probes, and so didn’t make the sorts of plots that I should have.

Plus, this is the first real microarray project I’ve been involved in, and I haven’t known what to look for. (I should have looked at images of correlation matrices with arrays sorted by similarity as determined from hierarchical clustering.)

You can’t make 3000 choose 2 scatterplots, but you can look at summaries of the marginal distributions of the 3000 arrays. I was thinking about how to depict 500 box plots, and came up with this:

There are ~500 arrays here, with the lines indicating quantiles: 1, 5, 10, 25, 50, 75, 90, 95, 99%iles. The first batch of arrays are the bad ones. You can see that those have a shift upward in median but also a heavy lower tail.

I quite like the picture.

The code is available within my R/broman package on github, though there’s not much to it.

Lesson 3: Don’t trust anyone

Don’t even trust yourself.

I relied on others to do the data cleaning. I should have checked things more carefully.

I looked at the data quite closely a year or so ago and detected sample mix-ups. In correcting those mistakes, I spent a lot of time looking at the array data and trying to identify bad arrays. Clearly I didn’t look closely enough, or I just looked at the wrong things.

Lesson 4: Take your time

Scientists are under a lot of pressure to produce and often are not exploring data as thoroughly as the data deserve.

But if you rush through data diagnostics, you could embarrass yourself later.

More importantly, take the time to follow up on apparent artifacts. For me, the unintended analyses are more interesting than the planned ones.

My basic strategy for data diagnostics

  • Expect errors
  • Think about what might have gone wrong and how such problems might be revealed
  • Make lots of specially-tailored plots
  • Create a (potentially informal) pipeline of all of the useful diagnostics you’ve used in the past
  • Make more plots just because you think they might interesting
  • If something looks odd, figure out why