Statistics | The stupidest thing...

Archive for the ‘Statistics’ Category

Halloween 2016 count

31 Oct 2016

Here’s a graph of the numbers of trick-or-treat-ers we saw this evening, by time. 10 of the 25 kids arrived in one big group. (Compare this to our 2011 experience.)

Halloween 2016 count

Tags:graphics, news
Posted in Statistics | Comments Off on Halloween 2016 count

My JSM 2016 itinerary

27 Jul 2016

The Joint Statistical Meetings are in Chicago next week. I thought I’d write down the set of sessions that I plan to attend. Please let me know if you have further suggestions.

First things first: snacks. Search the program for “spotlight” or “while supplies last” for the free snacks being offered. Or go to the page with the full list.

(more…)

Tags:conference
Posted in Statistics | 3 Comments »

Three years ago this week, I wrote a blog post, “Data science is statistics”. I was fiercely against the term at that time, as I felt that we already had a data science, and it was called Statistics.

It was a short post, so I might as well quote the whole thing:

When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.

If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

If you say that one kind of data analysis is statistics and another kind is not, you’re not allowing innovation. We need to define the field broadly.

You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term “statistics”.

I still sort of feel that way, but I must admit that my definition of “statistics” is rather different than most others’ definition. In my view, a good statistician will consider all aspects of the data analysis process:

the broader context of a scientific question
study design
data handling, organization, and integration
data cleaning
data visualization
exploratory data analysis
formal inference methods
clear communication of results
development of useful and trustworthy software tools
actually answering real questions

I’m sure I missed some things there, but my main point is that most academic statisticians focus solely on developing “sophisticated” methods for formal inference, and while I agree that that is an important piece, in my experience as an applied statistician, the other aspects are often of vastly greater importance. In many cases, we don’t need to develop sophisticated new methods, and most of my effort is devoted to the other aspects, and these are generally treated as being unworthy of consideration by academic statisticians.

As I wrote in a later post, “Reform academic statistics”, we as a field appear satisfied with

Papers that report new methods with no usable software
Applications that focus on toy problems
Talks that skip the details of the scientific context of a problem
Data visualizations that are both ugly and ineffective

Discussions of Data Science generally recognize the full range of activities that are required for the analysis of data, and place greater value on such things as data visualization and software tools which are obviously important but not viewed so by many statisticians.

And so I’ve come to embrace the term Data Science.

Data Science is also a much more straightforward and understandable label for what I do. I don’t think we should need a new term, and I think we should argue against misunderstandings of Statistics rather than slink off to a new “brand”. But in general, when I talk about Data Science, I feel I can better trust that folks will understand that I am talking about the broad set of activities required in good data analysis.

If people ask me what I do, I’ll continue to say that I’m a Statistician, even though I do tend to stumble over the word. But I am also a Data Scientist.

One last thing: I’ve also come to realize that computer science folks working in computational biology are really just like me. They have expertise in a somewhat different set of tools, but then that’s true for pretty much every statistician, too: they’re much like me but they have expertise in a somewhat different set of tools. And it’s nice to be able to say that we’re all data scientists.

It should be recognized, too, that academic computer science suffers from many of the same problems that academic statistics has suffered: an overemphasis on novelty, sophistication, and toy applications, and an under-appreciation for solving real problems, for data visualization, and for useful software tools.

Tags:data science, statistics
Posted in Statistics | 10 Comments »

Fitting linear mixed models for QTL mapping

24 Nov 2015

Linear mixed models (LMMs) have become widely used for dealing with population structure in human GWAS, and they’re becoming increasing important for QTL mapping in model organisms, particularly for the analysis of advanced intercross lines (AIL), which often exhibit variation in the relationships among individuals.

In my efforts on R/qtl2, a reimplementation R/qtl to better handle high-dimensional data and more complex cross designs, it was clear that I’d need to figure out LMMs. But while papers explaining the fit of LMMs seem quite explicit and clear, I’d never quite turned the corner to actually seeing how I’d implement it. In both reading papers and studying code (e.g., lme4), I’d be going along fine and then get completely lost part-way through.

But I now finally understand LMMs, or at least a particular, simple LMM, and I’ve been able to write an implementation: the R package lmmlite.

It seemed worthwhile to write down some of the details.

(more…)

Tags:mixed models, programming, QTL mapping, R
Posted in Genetics, Programming, R, Statistics | 8 Comments »

Reproducibility is hard

9 Sep 2015

Reproducibility is hard. It will probably always be hard, because it’s hard keeping things organized.

I recently had a paper accepted at G3, concerning a huge set of sample mix-ups in a large eQTL study. I’d discovered and worked out the issue back in December, 2010. I gave a talk about it at the Mouse Genetics meeting in Washington, DC, in June, 2011. But for reasons that I will leave unexplained, I didn’t write it up until much later. I did the bulk of the writing in October, 2012, but it wasn’t until February, 2014, that I posted a preprint at arXiv, which I then finally submitted to G3 in June this year.

In writing up the paper in late 2012, I re-did my entire analysis from scratch, to make the whole thing more cleanly reproducible. So with the paper now in press, I’ve placed all of that in a GitHub repository, but as it turned out, there was still a lot more to do. (You can tell, from the repository, that this is an old project, because there are a couple of Perl scripts in there. It’s been a long time since I’ve switched from Perl to Python and Ruby. I still can’t commit to just one of Python or Ruby…want to stick with Python, as everyone else is using it, but much prefer Ruby.)

The basic issue is that the raw data is about 1 GB. The clean version of the data is another 1 GB. And then there are results of various intermediate calculations, some are rather slow to calculate, which take up another 100 MB. I can’t reasonably put all of that within the GitHub repository.

Both the raw and clean data have been posted in the Mouse Phenome Database. (Thanks to Petr Simecek, Gary Churchill, Molly Bogue, and Elissa Chesler for that!) But the data are in a form that I thought suitable for others, and not quite in the form that I actually used them.

So, I needed to write a script that would grab the data files from MPD and reorganize them in the way that I’d been using them.

In working on that, I discovered some mistakes in the data posted to MPD: there were a couple of bugs in my code to convert the data from the format I was using into the format I was going to post. (So it was good to spend the time on the script that did the reverse!)

In addition to the raw and clean data on MPD, I posted a zip file with the 110 MB of intermediate results on figshare.

In the end, I’m hoping that one can clone the GitHub repository and just run make and it will download the data and do all of the stuff. If you want to save some time, you could download the zip file from figshare and unzip that, and then run make.

I’m not quite there, but I think I’m close.

Aspects I’m happy with

For the most part, my work on this project wasn’t terrible.

I wrote an R package, R/lineup, with the main analysis methods.
That I re-derived the full entire analysis cleanly, in a separate, reproducible document (I used AsciiDoc and knitr) was a good thing.
The code for the figures and tables are all reasonably clean, and draw from either the original data files or from intermediate calculations produced by the AsciiDoc document.
I automated everything with GNU Make.

What should I have done differently?

There was a lot more after-the-fact work that I would rather not have to do.

Making a project reproducible is easier if the data aren’t that large and so can be bundled into the GitHub repository with all of the code.

With a larger data set, I guess the thing to do is recognize, from the start, that the data are going to be sitting elsewhere. So then, I think one should organize the data in the form that you expect to be made public, and work from those files.

When you write a script to convert data from one form to another, also write some tests, to make sure that it worked correctly.

And then document, document, document! As with software development, it’s hard to document data or analyses after the fact.

Tags:papers, reproducible research
Posted in Academics, Statistics | 4 Comments »

God-awful conference websites

5 Aug 2015

What do I want in a conference website? Not this.

I want to be able to browse sessions to find the ones I’m interested in. That means being able to see the session title and time as well as the speakers and talk titles. A super-long web page is perfectly fine.
If you can’t show me everything at once, at least let me click-to-expand: for the talk titles, and then for the abstracts. Otherwise I have to keep clicking and going back.
I want to be able to search for people. And if I’m searching for Hao Wu, I don’t want to look at all of the Wus. Or all of the Haos. I just want the Hao Wus. If I can’t search on "Hao Wu", at least let me search on "Wu, Hao".
If my search returns nothing and I go back, bring me back to the same search form. Don’t make me have to click “Search for people” again.
I’d like to be able to form a schedule of the sessions to attend. (JSM2015 does that okay, but it’s not what I’d call “secure” and you have to find the damned things, first.) Really, I want to pick particular talks: this one in that session and that one in the other. But yeah, that seems a bit much to ask.

The JSM 2015 site is so terrible for browsing, I was happy to get the pdf of the program. (Good luck finding it on the website on your own; ASA tweeted the link to me, due to my bitching and moaning.) You can browse the pdf. That’s the way I ended up finding the sessions I wanted to attend. It also had an ad for the JSM 2015 mobile app. Did you know there was one? Good luck finding a link to that on their website, either.

The pdf is useable, but much like the website, it fails to make use of the medium. I want:

Bookmarks. I want to jump to where Monday’s sessions start without have to flip through the whole thing.
Hyperlinks. If you don’t include the abstracts, with links from the talk titles to the abstracts, at least include links to the web page that has the abstract so I don’t have to search on the web.
More hyperlinks. The pdf has an index, with people and page numbers. Why not link those page numbers to the corresponding page?

I helped organize a small meeting in 2013. The program on the web and the corresponding pdf illustrate much of what I want. (No scheduling feature, but that meeting had no simultaneous sessions.) I included gratuitous network graphs of the authors and abstracts. It’s 2015. No conference site is truly complete without interactive network graphs.

Update

As Thomas Lumley commented below, if you search on “Wu” you get all of the “Wu”s but also there’s one “Wulfhorst”. And if you search on “Hao” you get only people whose last name is “Hao”.

He further pointed out that if you search for the affiliation “Auckland” the results don’t include “University of Auckland” but only “Auckland University of Technology”. And actually, if you search for “University of Auckland” you get nothing. You need to search for “The University of Auckland”.

Tags:conference, stupid
Posted in Statistics, Things that annoy me | 2 Comments »

Memories of past JSMs

3 Aug 2015

The Joint Statistical Meetings (JSM) are the big statistics meetings in North America, “joint” among the American Statistical Association, Institute of Mathematical Statistics, International Biometric Society (ENAR and WNAR), Statistical Society of Canada, and others.

JSM 2015 is next week, in Seattle. In anticipation, I thought I’d write down some of my main memories of past JSMs.
(more…)

Tags:conference, statistics
Posted in Statistics | Comments Off on Memories of past JSMs

Initial steps towards reproducible research

4 Dec 2014

In anticipation of next week’s Reproducible Science Hackathon at NESCent, I was thinking about Christie Bahlai’s post on “Baby steps for the open-curious.”

Moving from Ye Olde Standard Computational Science Practice to a fully reproducible workflow seems a monumental task, but partially reproducible is better than not-at-all reproducible, and it’d be good to give people some advice on how to get started – to encourage them to get started.

So, I spent some time today writing another of my minimal tutorials, on initial steps towards reproducible research.

It’s a bit rough, and it could really use some examples, but it helped me to get my thoughts together for the Hackathon and hopefully will be useful to people (and something to build upon).

Tags:data analysis, data science, R, reproducible research
Posted in Academics, R, Statistics | 1 Comment »

Car crash stats revisited: My measurement errors

3 Nov 2014

Last week, I created revised versions of graphs of car crash statistics by state (including an interactive version), from a post by Mona Chalabi at 538.

Since I was working on those at the last minute in the middle of the night, to be included as an example in a lecture on creating effective figures and tables, I just read the data off printed versions of the bar charts, using a ruler.

I later emailed Mona Chalabi, and she and Andrew Flowers quickly posted the data to github.com/fivethirtyeight/data. (That repository has a lot of interesting data, and if you see data at 538 that you’re interested in, just ask them!)

I was curious to look at how I’d done with my measurements and data entry. Here’s a plot of my percent errors:

Not too bad, really. Here are the biggest problems:

Mississippi, non-distracted: off by 6%, but that corresponded to 0.5 mm.
Rhode Island and Ohio, speeding: off by 40 and 35%, respectively. I’d written down 8 and 9 mm rather than 13 and 14 mm.
Maine and Indiana, alcohol: wrote 15.5 and 14.5 mm, but typed 13.5 and 13 mm. In the former, I think I just misinterpreted my writing; in the latter, I think I wrote the number for the state below (Iowa).

It’s also interesting to note that my “total” and “non-distracted” were almost entirely under-estimates: probably an error in the measurement of the overall width of the bar chart.

Also note: @brycem had recommended using WebPlotDigitizer for digitizing data from images.

Tags:exploratory data analysis, graphics
Posted in Statistics | 1 Comment »

Interactive plot of car crash stats

30 Oct 2014

I spent the afternoon making a D3-based interactive version of the graphs of car crash statistics by state that I’d discussed yesterday: my attempt to improve on the graphs in Mona Chalabi‘s post at 538.

See it in action here.

Code on github.

Tags:D3, exploratory data analysis, graphics
Posted in Statistics | 3 Comments »

The stupidest thing…

Archive for the ‘Statistics’ Category

Halloween 2016 count

My JSM 2016 itinerary

I am a data scientist

Fitting linear mixed models for QTL mapping

Reproducibility is hard

Aspects I’m happy with

What should I have done differently?

God-awful conference websites

Update

Memories of past JSMs

Initial steps towards reproducible research

Car crash stats revisited: My measurement errors

Interactive plot of car crash stats

Pages

Recent Posts

Archives

Categories