## More on Chutes & Ladders

20 May 2013

Matt Maenner asked about the sawtooth pattern in the figure in my last post on Chutes & Ladders.

Damn you, Matt! I thought I was done with this. Don’t feed my obsession.

My response was that if the game ends early, it’s even more likely that it’ll be the kid who went first who won. But, my intuition was wrong: exactly the opposite is true. It is the advantage to the first player that causes the sawtooth pattern, but that advantage increases with the number of rounds rather than decreases.

### Numerical results

While it’s fast and easy to study the Chutes & Ladders game by simulation, if you want to answer questions more precisely, it’s best to switch to more exact results.

Consider a single individual playing the game, and let Xn be his/her location at round n. The Xn form a Markov chain, in that the future (Xn+1), given the present (Xn), is conditionally independent of the history (X1, …, Xn-1).

It’s relatively easy to construct the transition matrix of the chain. (See my R code.) This is a matrix P, with Pij = Pr(Xn+1 = j | Xn = i).

Then the probability that a player has reached state 100 by round n is
(1, 0, …, 0) Pn (0, …, 0, 1)’. That’s the cumulative distribution function (cdf) of the number of rounds for a single player to finish the game. Call this qn. You can get the probability distribution by differences, say pn = qn – qn-1.

To calculate the number of rounds to complete a game with k players, you want the minimum of k independent draws from this distribution. The probability that a game with k players is complete by round n is 1 – (1-qn)k. And again you can get the probability distributions by differences. Here’s a picture.

### Advantage to the first player

Now, regarding the advantage to the first player: note that the first player wins in exactly n steps if he gets to the finish at n steps and none of the other players are done by n-1 steps. So, with k players, the probability that the first player wins in exactly n steps is pn (1-qn-1)k-1.

The chance that the second player wins in exactly n steps is (1-qn) pn (1-qn-1))k-2, with the last term included only if there are k > 2 players.

From this idea, it’s straightforward to calculate the probability that the first player wins given that the game is complete at round n. Here’s a plot of that probability as a function of the number of players, relative to the nominal probability (1/2, 1/3, 1/4).

Note that n=7 is the minimum number of rounds to complete the game. I’d thought that the first player’s advantage went down over time, but the opposite is true.

### No. spins to end the game

Combining these two results (on the number of rounds to complete the game and the probability that player i will win in n rounds), we can get a more precise version of the simulation-based figure in my last post:

As you can see, the sawtooth pattern becomes more pronounced with the number of rounds, but then it gets lost in the downward slope of the distribution on the right side. (Again, see my R code.)

## Chutes & ladders: How long is this going to take?

17 May 2013

I was playing Chutes & Ladders with my four-year-old daughter yesterday, and I thought, “How long is this going to take?”

I saw an interesting mathematical analysis of the game a few years ago, but it seems to be offline, though you can read it via the wayback machine.

But that didn’t answer my specific question, namely, “How long is this going to take?”

So I wrote a bit of R code to simulate the game.

Here’s the distribution of the number of spins to complete the game, by number of players:

With two players, the average number of spins is 52, with a 90th percentile of 88.

If you add a third player, the average increases to 65, and the 90th percentile increases to 103. You’re playing fewer rounds, but each round is three times as long. If you add a fourth player, the average is 76 and the 90th percentile is 117.

So, in trying to minimize the agony, it seems best to not encourage my eight-year-old son to join us in the game. If he plays with us, there’s a 63% chance that it will take longer.

And that’s particularly true because then the chance of my daughter winning drops from about 1/2 to about 1/3.

That raises another question: if I let her go first, what advantage does that give her? Not much. The chance that the person who goes first will win is 50.9%, 34.4%, and 25.9%, respectively, when there are 2, 3, and 4 players. So not a noticeable amount. Thus I cheat (on her behalf). Really, though, I’m cheating in order to shorten the game as much as to ensure that she wins.

Note: There’s a close connection between this problem and my work on the multiple-strain recombinant inbred lines. (See this and that.) I’m tempted to play around with it some more.

## Stack Exchange: Why I dropped out

13 May 2013

Stack Exchange is a series of question-and-answer sites, including Stack Overflow for programming and Cross Validated for statistics. I was introduced to these sites at a short talk by Barry Rowlingson at the 2011 UseR! meeting, “Why R-help must die!“

These sites have a lot of advantages over R-help: The format is easier to read, math and code can be nicely formatted, the questions are tagged, search is easier, and there should be less redundancy.

• It’s good to help people.
• It’s fun to rack up reputation points for helping people.
• It’s good exercise, in both thinking about statistical questions and in articulating useful answers (and there are some interesting questions).

### So I gave up

I started spending time on stackoverflow and cross-validated soon after returning from UseR! 2011, but I lost my patience and quit within three months.

One needs to treat each question with respect, and I eventually seemed to lose my ability to sustain such goodwill. I think I take things too personally.

### Update

I should clarify: I do continue to use Stack Exchange, mostly through google. Many problems I run into have already been answered. I just don’t have the right temperament to participate regularly in answering others’ questions.

## Tutorials on git/github and GNU make

10 May 2013

If you’re not using version control, you should be. Learn git.

If you’re not on github, you should be. That’s real open source.

To help some colleagues get started with git and github, I wrote a minimal tutorial. There are lots of git and github resources available, but I thought I’d give just the bare minimum to get started; after using git and github for a while, other resources make a lot more sense and seem much more worthwhile.

And for R folks, note that it’s easy to install R packages that are hosted on github, using Hadley Wickham‘s devtools package. For example, to install Nacho Caballero‘s clickme package:

```install.packages("devtools")
library(devtools)
install_github("clickme", "nachocab")
```

Having written that git/github tutorial, I thought: I should write more such!

So I immediately wrote a similar short tutorial on GNU make, which I think is the most important tool for reproducible research.

## “My” chromosome 8p inversion

8 May 2013

There was lots of discussion on twitter yesterday about Graham Coop’s paper with Peter Ralph (or vice versa), on The geography of recent genetic ancestry across Europe, particularly regarding the FAQ they’d created.

I was eager to take a look, and, it’s slightly embarrassing to say, I first did a search to see if they’d made a connection to any of my work. (I’m probably not the only one to do that.) Sure enough, they cited a paper of mine, but it was Giglo et al. (2001) Am J Hum Genet 68: 874–883, on “my” chr 8p inversion, and not what I’d expected, my autozygosity paper.

What did the chr 8p inversion have to do with this? Search for “[36]” and you’ll find:

We find that the local density of IBD blocks of all lengths is relatively constant across the genome, but in certain regions the length distribution is systematically perturbed (see Figure S1), including around certain centromeres and the large inversion on chromosome 8 [36], also seen by [35].

The chr 8p inversion presents an interesting data analysis story from my postdoc years. In a nutshell: I was studying human crossover interference, found poor model fit for maternal chr 8 that was due to tight apparent triple-crossovers in two individuals in each of two families, hypothesized that there was an inversion in the region, but it would have to be both long and with both orientations being common. The inversion was confirmed via FISH, and it’s something like 5 Mbp long, with the frequencies of the two orientations being 40 and 60% in people of European ancestry.

## \$18 for a two page PDF? I still don’t get it.

2 May 2013

Yesterday, I saw this tweet by @Ananyo

Time that biologists stopped telling the public oversimplistic fairy tales on Darwinian evolution, says P Ball (\$) nature.com/nature/journal…

So I clicked the link to the Nature paper and realized, “Oh, yeah. I’ve got to enter through the UW library website.”

But then I thought, “Wait…\$18 for a two-page Nature comment? WTF?”

So I tweeted:

DNA: Celebrate the unknowns, like this Nature comment, which costs \$18. nature.com/nature/journal…

And thinking about it some more, I got more annoyed, and tweeted:

Why do publishers charge such high per-article fees? At \$18/artcl, you’d have to be desperate or stupid to pay; at \$1-2, prob’ly lots would.

And then I thought, I’ll ask Nature directly:

@NatureMagazine Why is the per-article charge so high? It seems like you’d make more profit at \$2/article.

And they responded:

@kwbroman For a while now, individual papers can be rented through @readcube for \$3-5. A full tablet subscription to Nature costs \$35.

.@NatureMagazine So is the \$18 charge for a 2 pg PDF just to discourage piracy?

I thought a lot about whether to put “piracy” in quotes or not, or whether to write “copyright infringement” instead.

But anyway, they responded:

@kwbroman just as with any product, the more you buy, the more you save. Media/publishing subscriptions have worked this way for decades.

That again didn’t quite answer my question.

### It’s a scam

I still don’t understand the \$18 business. It’s not “The more you buy, the more you save.” It’s, “Buy the whole season for \$35, or buy 5 min from Episode 1 for \$18.”

I understand that the cover price of Wired is \$5 per issue, while I could get a year’s subscription for \$15-20. But that’s not the same as \$18 for one article vs \$200 per year.

The \$18 for a two-page PDF is like 900 numbers and paycheck advances. These are scams taking advantage of desperate or stupid people.

If they don’t want to sell the PDFs for individual articles for a reasonable price, they should just not sell them at all.

## Methods before results

29 Apr 2013

It’s great that, in a step towards improved reproducibility, the Nature journals are removing page limits on Methods sections:

To allow authors to describe their experimental designs and methods in enough detail for others to interpret and replicate them, the participating journals are removing length restrictions on Methods sections.

But couldn’t they include the Methods section in the pdf for the article? For example, consider this article in Nature Genetics; the Methods section is only available in the html version of the paper. The PDF says:

Methods and any associated references are available in the online version of the paper.

Methods are important.

• They shouldn’t be separated from the main text.
• They shouldn’t be placed after the results (as so many journals, including PLoS, do).
• They shouldn’t be in a smaller font than the main text (as PNAS does).
• They certainly shouldn’t be endnotes (as Science used to do).

### Supplements annoy me too

I love supplemental material: authors can give the full details, and they can provide as many supplemental figures and tables as they want.

But supplements can be a real pain.

• I don’t want to have to click on 10 different links. Put it all in one document.
• I don’t want to have to open Word. Put text and figures in a PDF.
• I don’t want to have to open Excel. Put data in a plain text file, preferably as part of a git repository with related code.

At least supplements are now included at the journal sites!

This paper in Bioinformatics refers to a separate site for supplemental information:

Expression data and supplementary information are available at

But `rii.com` doesn’t exist anymore. I was able to find the supplement using the Wayback Machine, but

• The link in the paper was wrong: It should be `.html` not `.htm`
• The final version on Wayback has a corrupted PDF, though one can go back to previous versions that are okay.

### I like Genetics and G3

Genetics and G3 put the Methods where they belong (before the results), and when you download the PDF for an article in Genetics, it includes the supplement. For a G3 article, the supplement isn’t included in the article PDF, but at least you can the whole supplement as a single PDF.

For example, consider my recent Genetics articles:

If you click on “Full Text (PDF),” you get the article plus the 3 supplemental figures and 23 supplemental tables in the former case, and article plus the 17 supplemental figures and 2 supplemental tables in the latter case.

## Use meaningful URLs

10 Apr 2013

QR codes are stupid. See the well-known flowchart.

And I don’t like Drupal. Sites that use it give things URLs like `http://www.genetics.wisc.edu/node/577` for their seminar list.

And can we get rid of the `www`?

“double-u double-u double-u …”

“Zzz…”

URLs should be meaningful and short. I like deep hierarchies of folders, but it makes for long URLs.

URL-shorteners help, but you don’t really want to read out (or type) one of those short URLs. And they tell you nothing about where they’re going.

What you want is something like `bcaffo.com` or `stodden.net`. Or `rqtl.org`.

But…I guess you could just say “I’ll send you an email.”

### And customize `<title>`

And while I have your attention, note that the title of your web page shows up on Google (and at the top of the browser).

It’s nice to see others make use of my html code, but you shouldn’t leave my name in the title of your publication page.

Put the important words first (not like the title for my “official page”), and perhaps nothing else. For example, the title shouldn’t include “Drupal”.

Update: Read this: “URLs are for People, not Computers

I could have just given the URL:
`http://www.not-implemented.com/urls-are-for-people-not-computers`

## Knuth: Journal referees should assist authors

8 Apr 2013

When serving as referee for a journal, who are you working for?

• The editor: Will the paper add to the journal’s prestige?
• The author: How can it be improved?

I’d long thought that the referee’s duty was to the journal editors and then to the readers.

But Donald Knuth’s comments on refereeing persuaded me that I should focus primarily on helping the author to improve the manuscript.

See pages 31-35 (as numbered; actually 33-37 in the pdf) in his notes on mathematical writing. And here’s the missing page on “Hints for referees”.

Even a terrible manuscript can be published, if the author is sufficiently persistent. Your primary job as referee should be to help the author to make it as good as it can be.

Almost immediately after I first read Donald Knuth’s comments (back in 2002), I received one of the worst manuscripts I’ve ever read. It was one of those cases where I really wish the authors were anonymous, because I can’t forget who was responsible for it.

It was hard for me to say, “You have no idea what you’re doing” in a constructive way. (“You should abandon this manuscript“ is not constructive, but it could be good advice. The scientific literature could use a bit more self-censorship.)

And I’ve learned to use the “Comments to the editor” as my opportunity to vent. (I would pity the poor editor on the other end, but she/he sent the thing to me!) I’d give an example of my venting, but I think I’ll leave that to another time.

## Data science is statistics

5 Apr 2013

When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.

If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

If you say that one kind of data analysis is statistics and another kind is not, you’re not allowing innovation. We need to define the field broadly.

You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term “statistics”.