Archive for March, 2013

Data structures are important

19 Mar 2013

I’ve created another D3 example, of QTL analysis for a phenotype measured over time. (Click on the image for the interactive version.)

QTL analysis with phenotype over time

The code is on github. It took me about a day.

The hardest part was figuring out the right data structures. A pixel here is linked to curves over there and over there and over there. You need to set things up so it’s easy to traverse such linkages.

If you hover over a point in the top-left image, you get views of the vertical and horizontal cross-sections. If you click on a point, pointwise confidence bands are added to the “QTL effect” plot. (You have to click, because if I included those confidence bands automatically, the graph became painfully slow to refresh.)

I’m not completely happy with the layout of the graph; it’s not particularly intuitive.


ENAR highs and lows

19 Mar 2013

I attended the ENAR meeting in Orlando, Florida, last week. (ENAR = “Eastern North American Region” of the International Biometric Society.)

I had a great time, but I did come to the strong realization that what I view as important is distinctly different from what the typical ENAR attendee views as important. (Rafa said, incredulously, “You knew that already!”)

Let me tell you about the high- and lowlights, for me.

Also see the ENAR-related comments from Yihui Xie and Alyssa Frazee.


LaTeX + Unicode → XeTeX

19 Mar 2013

I’m co-organizing a scientific meeting at the end of May. The abstracts are all in.

We get them in an Excel file, and I was working on a Perl script to parse the file to create a LaTeX file with the abstracts, so we could have nicely formatted versions for review. (I’m using Spreadsheet:XLSX for the first time; it’s really easy. Why have I always converted Excel files to CSV before parsing them?)

I spent way too much time trying to deal with special characters. I was looking to do a search-and-replace for all possible Unicode characters (for example, to change \xE9 aka é into \'{e}, or \xD7 aka × into $\times$).


But then I discovered that XeTeX supports Unicode, so there’s no need to do these sorts of substitutions.

I changed pdflatex to xelatex in my Makefile, and I’m done. I think.

Update: Now that I think about it, CSV is way more convenient than XLS(X) for simple data files, as you don’t have to traverse with the whole $cell -> (Val) business. But working with the Excel file directly is easier when the cells may contain lots of text with commas and such, like my abstracts.

Why aren’t all of our graphs interactive?

16 Mar 2013

I’ve come to believe that, for high-dimensional data, visualizations (aka graphs), and particularly interactive graphs, can be more important than precise statistical inference.

We first need to be able to view and explore the data, and when it is unusually abundant, that is especially hard. This was a primary contributor to my recent embarrassments, in which clear problems in the data were not discovered when they should have been.

I gave a talk on interactive graphs (with the title above) at Johns Hopkins last fall, and then a related talk at ENAR earlier this week, and I have a few thoughts to add here.

A brief digression

I’m giving a talk at a plant breeding symposium at Kansas State in a couple of weeks, and I’ve been pondering what to talk about. A principal problem is that I don’t really work on plant breeding. My most relevant talks are a bit too technical, and my more interesting talks are not relevant.

Then I had the idea to talk about some of my recent work with my graduate student, Il-youp Kwak, on the genetic analysis of phenotypes measured over time.

I realized that I could incorporate some interactive graphs into the talk. Initially I was just thinking that the interactive graphs would make the talk more interesting and would allow me to talk about things that weren’t necessarily relevant but were interesting to me.

But then I realized that this work really cries out for interactive graphs. And as I begin to construct one of them, I thought of a whole bunch more I might create. More importantly, I realized that these interactive graphs are extremely useful teaching tools.

More D3 examples

Here’s an image of first graph I created for the talk; click on it to jump to the interactive version.

Statisticians are often confronted with a large set of curves. We’d like to show the individual curves, but there are too many. The resulting spaghetti plot is a total mess. An image plot (like the lasagna plot) allows us to see all of the curves, but it can be hard to get a sense of what the actual curves look like. The interactive version solves the problem.

Many curves

Here’s a second example; again click on the image to jump to the interactive version. (I’ve shown this before, but I want to use it to make another point.)

Typically, in a lecture on complex trait analysis, I’d show one LOD curve (like the top panel in the image below) and a few different plots of phenotype vs genotype (the lower-right panel in the image). I think the exploratory tool will be much more effective, in a lecture, for explaining what it all means.

LOD and QTL effects

Statisticians need to be doing this routinely

In constructing a graph, one must make some difficult choices. For high-dimensional data, one must greatly compress the available information. The resulting summaries, while potentially informative, take one far away from the original data.

Interactive graphs provide a means through which one may view the overall summary but have immediate access to the underlying details.

Towards making my own papers reproducible

10 Mar 2013

Much has been written about reproducible research: that scientific papers should be accompanied by the data and software sufficient to reproduce the results. It’s obviously a Good Thing. But it can be hard to stick to this ideal in practice.

For my early papers, I’m not sure I can find the materials anymore, and that’s just 15 years back.

For my recent papers, I have developed a sort of system so that I can reproduce the results myself. I use a combination tools, including R, Sweave, perl, and, of course, make.

But I’ve not been distributing the detailed code behind my papers. It’s not always pretty, and it is not well documented. (And “not always” is a bit of a stretch — more like “seldom.”)

When Victoria Stodden visited Madison last fall, I was inspired to release this code, but I never carried through on that.

But then last night I twittered an example graph from a paper, was (appropriately) asked to produce code, and so created a github repository with the bulk of the code for that paper.

The repository is incomplete: it doesn’t include the code to do the main analysis and simulations, but just to make the figures, starting from those results. I’ll work to add those additional details.

And, even once complete, it will be far from perfect. The code is (or will be) there, but it would take a bit of work for an interested reader to figure out what it is doing, since much of it is undocumented and poorly written.

But if we ask for perfection, we’ll get nothing. If we ask for the minimal materials for reproducibility, we may get it.

So that’s my goal: to focus first on minimal accessibility of the code and data behind a paper, even if it is minimally readable and so might take quite a bit of effort for someone else to follow.

One last point: I use local git repositories for my draft analyses and for the whole process of writing a paper. I could post that whole history, but as I said before:

Open source means everyone can see my stupid mistakes. Version control means everyone can see every stupid mistake I’ve ever made.

It would be easy to make my working repository public, but it would include things like referees’ reports and my responses to them, as well as the gory details on all of the stupid things that I might do along the way to publication.

I’m more comfortable releasing just a snapshot of the final product.

What a waste of paper

7 Mar 2013

A university press asked me to review a book manuscript, and the author “has asked that we not use electronic copies.” So they’re going to send me a hard copy.

My response: “If you only give me a paper copy, I’m going to just scan it and toss the paper. That seems like a waste of time (and paper and postage).”

I should probably have said, “Then forget it.”

Their response included, “What you do with it when it arrives I am just going to take a hear-no-evil approach to.”

The Hopkins SPH logo, part 3: Karl’s revenge

6 Mar 2013

I’m finally back to the story of the Johns Hopkins Bloomberg School of Public Health logo.

I wrote Part 1 in mid-November and Part 2 a week later. I said that Part 3 was “coming soon,” but I’ve been a bit shy about revealing this. You’ll see why.


Interactive eQTL plot with d3.js

6 Mar 2013

I just finished an interactive eQTL plot using D3, in preparation for my talk on interactive graphics at the ENAR meeting next week.

Static view of interactive eQTL plot

The code (in CoffeeScript) is available at github. But beware: it’s pretty awful.

The hardest part was setting up the data files. Well, that plus the fact that I just barely know what I’m doing in D3.


2 Mar 2013

To use the latest version of D3, you need to use charset="utf-8" in the call to <script>.

I’m giving a talk at ENAR in just over a week, on interactive graphics. My slides (still in preparation) are on the web.

The slides were working fine locally on my laptop, but they weren’t working on my web server…I was getting a syntax error regarding an illegal character.

I figured out that I needed to add charset="utf-8", like so:

<script charset="utf-8" type="text/javascript" src="js/d3.js">

A bit more on the fire

1 Mar 2013

The Daily Cardinal reports that there were no overhead sprinklers in the area of the fire yesterday. Fire fighters thought that sprinklers had gone off, but it was really a broken water pipe.

And there are no sprinklers in my office, either. I thought they were required, but I guess only in new construction.

There’s a short TV report (after a commercial) at the channel 15 site.

A graduate student interviewed said, “I have multiple copies of my data, but they’re all in that building.”

I hope we all learn from this: Off-site backups (at least with something like DropBox) are important.