MongoDB with D3.js

22 Jun 2015

I consider interactive data visualization to be the critical tool for exploration of high-dimensional data.

That’s led me to spend a good amount of time in the last few years learning some new skills (D3 and CoffeeScript) and developing some new tools, particularly the R package R/qtlcharts, which provides interactive versions of the many data visualizations in R/qtl, my long-in-development R package for mapping genetic loci (called quantitative trait loci, QTL) that underlie complex trait variation in experimental organisms.

R/qtlcharts is rough in spots, and while it works well for moderate-sized data sets, it can’t well handle truly large-scale data, as it just dumps all of the data into the file viewed by a web browser.

For large-scale data, one needs to dynamically load slices of the data based on user interactions. It seems best to have a formal database behind the scenes. But I think I’m not unusual, among statisticians, in having almost no experience working with databases. My collaborators tend to keep things in Excel. Even for quite large problems, I keep things in flat files.

So, I’ve been trying to come to understand the whole database business, and how I might use one for larger-scale data visualizations. I think I’ve finally made that last little conceptual step, where I can see what I need to do. I made a small illustration in my d3examples repository on GitHub.

Read the rest of this entry »

Randomized Hobbit

22 Jun 2015

@wrathematics pointed me to his ngram R package for constructing and simulating from n-grams from text.

I’d recently grabbed the text of the hobbit, and so I applied it to that text, with amusing results.

Here’s the code I used to grab the text.

library(XML)
stem <- "http://www.5novels.com/classics/u5688"
hobbit <- NULL
for(i in 1:74) {
    cat(i,"\n")
    if(i==1) {
        url <- paste0(stem, ".html")
    } else {
        url <- paste0(stem, "_", i, ".html")
    }

    x <- htmlTreeParse(url, useInternalNodes=TRUE)
    xx <- xpathApply(x, "//p", xmlValue)
    hobbit <- c(hobbit, gsub("\r", "", xx[-length(xx)]))
    Sys.sleep(0.5)
}

Then calculate the ngrams with n=2.

library(ngram)
ng2 <- ngram(hobbit, n=2)

Simulate some number of words with babble(). If you use the seed argument, you can get the result reproducibly.

babble(ng2, 48, seed=53482175)

into trees, and then bore to the Mountain to go through?” groaned the hobbit. “Well, are you doing, And where are you doing, And where are you?” it squeaked, as it was no answer. They were surly and angry and puzzled at finding them here in their holes

Update: @wrathematics suggested that I mix two texts, so here’s a bit from the Hobbit in the Hat (The Hobbit with 59× Cat in the Hat — up-sampled to match lengths.) But there’s maybe not enough overlap between the two texts to get much of a mixture.

“I am Gandalf,” said the fish. This is no way at all!

already off his horse and among the goblin and the dragon, who had remained behind to guard the door. “Something is outside!” Bilbo’s heart jumped into his boat on to sharp rocks below; but there was a good game, Said our fish No! No! Those Things should not fly.

Cheat sheets for R-based Software Carpentry course

29 Apr 2015

At the Software Carpentry workshop at UW-Madison in August, 2014, one of the students suggested that we hand out some cheat sheets on each topic. I thought that was a really good idea.

So at the SWC workshop at Washington State University this week, we handed out the following five pages:

I really appreciate the work (and design sense) that were put into these.

2015 AAAS in San Jose

13 Feb 2015

I’m at the 2015 AAAS meeting in San Jose, California. This is definitely not my typical meeting: too big, too broad, and I hardly know anyone here. But here’s a quick (ha ha; yah sure) summary of the meeting so far.

Opening night

Gerald Fink gave the President’s Address last night. He’s the AAAS President, so I guess that’s appropriate. But after five minutes of really lame simplistic crap (for example, he said something like, “A single picture can destroy our known understanding of the universe,” like innovation and improving our understanding is a bad thing), I left.

Oh, and before that: the emcee of the evening, who introduced Janet Napolitano, totally couldn’t pronounce her last name. (Her remarks, particularly her comments in support of public universities, were quite powerful.) Old dude: practice such things! Your ineptness reveals that you haven’t paid proper attention to her.

Sightings

A huge meeting, but I know next to no one here. But I ran into Sanjay Shete in the exhibit hall, where I attempted to get two of every tchotchke. (My kids will pitch a fit if one gets something, no matter how lame the thing, and the other doesn’t.) Sanjay was named AAAS Fellow, that’s why he’s here.

I also ran into Steve Goodman (not the folk singer who died too young, but a singer, nevertheless). Gotta love Steve Goodman! He produced Behind the tan door.

Highlights

I went to a dozen talks. A half-dozen I really liked.

Alan Aspuru-Guzik talked about how to find (and visualize) useful organic molecules among the 1060 (or 10180?) possible. Cool high-throughput computing and interactive graphics to produce better solar panels (particularly for developing countries) and huge batteries to store wind- and solar-based power.

Russ Altman talked about how to search databases, web-search histories, and social media, to identify pairs of drugs that, together, give bad (or good) side effects that wouldn’t be predicted from their on-their-own side effects.

David Altshuler had a hilarious outline slide for his talk, but the rest was really awesome. A key point: to develop precision medicine will require hard work and there’s no magic bullet. And basic (not just translational) research is critical: we can’t make a medicine that gets to the precise cause (and that’s what precision medicine is about) if we don’t understand that basic biology.

I gave a talk myself, in a session on visualization of biomedical data, but it was definitely not the best talk in the session, nor the second best. Mine might have been the worst of the five talks in the session. But that’s okay; I think I did fine. It’s just that Sean Hanlon (brother of my UW–Madison colleague, Bret Hanlon) put together a superb, but thinly-attended, session.

Miriah Meyer’s was my favorite talk of the day. She develops visualization tools to help scientists make sense of their data. And her approach is much like mine: specific solutions to specific data and questions. She talked about MulteeSum, PathLine, and MizBee. Favorite quote: “It’s amazing how much people like circles these days.”

Frederick Streitz from Lawrence Livermore National Lab talked about simulating and visualizing the electrophysiology of the human heart at super-high resolution using a frigging huge cluster, with 1.5 million cores. I loved his analogies: if you are painting your house, having a friend or two over to help will reduce the time by the expected factor, but having 1000 friends or 100k friends to help? In parallel computing, you need to rethink what you’ll use the computers for.

His second analogy: The DOE cluster at Livermore is 100k times a desktop computer. That’s like the difference between PacMan (1980, 2.1 megaFLOPS) to Assassin’s Creed (2011, 260 GigaFLOPS). And their cluster is 100k times that.

At the end of the day, Daphne Koller talked about Coursera. She’s awesome; Coursera’s awesome; I’m a crappy teacher. That’s my thinking at the moment, anyway. (A video of her talk is online. Have I mentioned how much I hate it when people screw up the aspect ratio? It seems like they screwed up the aspect ratio.) University faculty exist to help people, and with Coursera and other MOOCs, we can help a lot of people. Key lessons: the value of peer grading (for learning), not being constrained by the classroom or the 60-min format, ability to explore possible teaching innovations, and just having a hugely broad reach.

I don’t think I’d heard the quote that Daphne mentioned, attributed to Edwin Emery Slosson:

College is a place where a professor’s lecture notes go straight to the students’ lecture notes, without passing through the brains of either.

Still laughing!

Food

As I mentioned on twitter, I’ve eaten a lot of tacos. But I also had some donuts.

Boy, am I old

I seem to be staying at the same hotel as the American Junior Academy of Sciences (AJAS). Are these high school or college students? Man, do I feel old.

My contribution to education, today: if all of the elevators going down are too packed to accept passengers, press the up button and ride it up and then down. (Later I learned, from one of the AJAS youth, that the “alarm will sound” sign at the bottom of the stairs is a lie. You can take the stairs.)

Initial steps towards reproducible research

4 Dec 2014

In anticipation of next week’s Reproducible Science Hackathon at NESCent, I was thinking about Christie Bahlai’s post on “Baby steps for the open-curious.”

Moving from Ye Olde Standard Computational Science Practice to a fully reproducible workflow seems a monumental task, but partially reproducible is better than not-at-all reproducible, and it’d be good to give people some advice on how to get started – to encourage them to get started.

So, I spent some time today writing another of my minimal tutorials, on initial steps towards reproducible research.

It’s a bit rough, and it could really use some examples, but it helped me to get my thoughts together for the Hackathon and hopefully will be useful to people (and something to build upon).

The value of thesis intro/discussion

3 Dec 2014

Last week, Kelly Weinersmith tweeted:

Is any task a more monumental waste of time than writing an introduction and discussion for a dissertation where the chapters are published?

I think many (or most?) of my colleagues would agree with her. The research and the papers are the important things, and theses are hardly read. Why spend time writing chapters that won’t be read?

My response was:

Intro & disc of thesis get the student to think about the broader context of their work.

I’d like to expand on that just a bit.

In the old days, a PhD dissertation was more of a monograph. The new style is to have three or so papers (published or ready-to-submit) as chapters, sandwiched between introductory and discussion chapters. Those intro and discussion chapters are sometimes quite thin. I would prefer them to be more substantial.

The focus on papers is a good thing, as they will be easier to find and more widely read. But a thesis/dissertation is not just a research product, but also a vehicle to get a student to think more deeply and broadly.

The individual papers will include introductory and discussion sections, but journal articles tend to be aimed towards a relatively narrow and specialized audience. More substantive introductory and discussion chapters can help to make the work accessible to a broader audience. They also help to tie the separate papers together: what is the larger scientific context, and how do these pieces of work fit into that?

I don’t want students wasting time on “busy work,” and writing a thesis does seem like busy work. But I think a thesis deserves more than a ten-paragraph introduction. And the value of that introduction is not so much in demonstrating the student’s knowledge, but in being part of the development of that knowledge.

Car crash stats revisited: My measurement errors

3 Nov 2014

Last week, I created revised versions of graphs of car crash statistics by state (including an interactive version), from a post by Mona Chalabi at 538.

Since I was working on those at the last minute in the middle of the night, to be included as an example in a lecture on creating effective figures and tables, I just read the data off printed versions of the bar charts, using a ruler.

I later emailed Mona Chalabi, and she and Andrew Flowers quickly posted the data to github.com/fivethirtyeight/data. (That repository has a lot of interesting data, and if you see data at 538 that you’re interested in, just ask them!)

I was curious to look at how I’d done with my measurements and data entry. Here’s a plot of my percent errors:

Percent measurement errors in Karl's car crash stats

Not too bad, really. Here are the biggest problems:

  • Mississippi, non-distracted: off by 6%, but that corresponded to 0.5 mm.
  • Rhode Island and Ohio, speeding: off by 40 and 35%, respectively. I’d written down 8 and 9 mm rather than 13 and 14 mm.
  • Maine and Indiana, alcohol: wrote 15.5 and 14.5 mm, but typed 13.5 and 13 mm. In the former, I think I just misinterpreted my writing; in the latter, I think I wrote the number for the state below (Iowa).

It’s also interesting to note that my “total” and “non-distracted” were almost entirely under-estimates: probably an error in the measurement of the overall width of the bar chart.

Also note: @brycem had recommended using WebPlotDigitizer for digitizing data from images.

Interactive plot of car crash stats

30 Oct 2014

I spent the afternoon making a D3-based interactive version of the graphs of car crash statistics by state that I’d discussed yesterday: my attempt to improve on the graphs in Mona Chalabi‘s post at 538.

Screen shot of interactive graph of car crash statistics

See it in action here.

Code on github.

Scholarly Publishing Symposium at UW-Madison

30 Oct 2014

At the Scholarly Publishing Symposium at UW-Madison today. Has interesting list of supplemental materials, but apparently only on paper:

Supplemental materials from UW-Madison Scholarly Publishing Symposium

So here they are electronically.

Improved graphs of car crash stats

29 Oct 2014

Last week, Mona Chalabi wrote an interesting post on car crash statistics by state, at fivethirtyeight.com.

I didn’t like the figures so much, though. There were a number of them like this:

chalabi-dearmona-drinking

I’m giving a talk today about data visualization [slides | github], and I thought this would make a good example, so I spent some time creating versions that I like better.
Read the rest of this entry »


Follow

Get every new post delivered to your Inbox.

Join 109 other followers