Archive for May, 2014

hipsteR: re-educating people who learned R before it was cool

15 May 2014

This morning, I started a tutorial for folks whose knowledge of R is (like mine) stuck in 2001.

Yesterday I started reading the Rcpp book, and on page 4 there’s an example using the R function replicate, which (a) I’d never heard before, and (b) is super useful.

I mean, I often write code like this, for a bootstrap:

x <- rnorm(2500)
sapply(1:1000, function(a) quantile(sample(x, replace=TRUE), c(0.025, 0.975)))

But I could just be writing

x <- rnorm(100)
replicate(1000, quantile(sample(x, replace=TRUE), c(0.025, 0.975)))

“Oh, replicate must be some new function.” Yeah, new in R version 1.8, in 2003!

I’m in serious need of some re-education (e.g., I should be using more of Hadley Wickham’s packages). Hence the beginnings of a tutorial.


Note: the title was suggested by Thomas Lumley. No connection to @hspter; it’s not really so hip. I probably should have written “geeksteR.”

Advertisements

Further points on crayon colors

9 May 2014

I saw this great post on crayola crayon colors at the Learning R blog, reproducing a nice graph of the Crayola crayon colors over time. (Also see this even nicer version.)

The Learning R post shows how to grab the crayon colors from the wikipedia page, “List of Crayola crayon colors,” directly in R. Here’s the code (after some slight modifications due to changes in the page since 2010):

library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_Crayola_crayon_colors"
crayontable <- readHTMLTable(theurl, stringsAsFactors = FALSE)[[1]]
crayons <- crayontable[,grep("Hex", colnames(crayontable))]
names(crayons) <- crayontable[,"Color"]

Comparing these to what I’d grabbed, I noted one small discrepancy on the Wikipedia page: Yellow Orange was listed as "#FFAE42" but the background color for the Yellow Orange cell in the table was "#FFB653".

So I created a Wikipedia account and edited the Wikipedia page.

(Then I realized that I’d made a mistake in my edit, undid my change, thought the whole thing through again, and edited the page again.)

The Learning R post also showed a different way to sort the colors: convert to HSV, and then sort by the H then S then V. So I edited my plot_crayons() function again, to create the following picture:

Crayon colors, again

Two more points about crayon colors

8 May 2014

If you want to use crayon colors in R but you don’t want to rely on my R/broman package, you can just grab the code. Copy the relevant lines from the R/brocolors.R file:

crayons = c("Almond"="#efdecd",
            "Antique Brass"="#cd9575",
            "Apricot"="#fdd9b5",
            ...
            "Yellow Green"="#c5e384",
            "Yellow Orange"="#ffb653")

I spent a bit of time thinking about how best to sort the colors in a meaningful way, for the plot_crayons() function. But then decided to stop thinking and just do something brainless: measure distance between colors by RMS difference of the RGB values, and then use hierarchical clustering. Here’s the code from plot_crayons():

# get rgb 
colval <- t(col2rgb(crayons))

# hclust to order the colors
ord <- hclust(dist(colval))$order

It’s not perfect, but I think it worked remarkably well:

Crayon colors

Crayon colors in R

7 May 2014

Last night I was working on a talk on creating effective graphs. Mostly, I needed to update the colors, as there’d been some gaudy ones in its previous form (e.g., slide 22).

I usually pick colors using the crayons in the Mac Color Picker. But that has just 40 crayons, and I wanted more choices.

That led me to the list of Crayola crayon colors on wikipedia. I wrote a ruby script to grab the color names and codes and added them to my R/broman package.

Use brocolors("crayons") to get the list of colors. For example, to get “Tickle Me Pink,” use

library(broman)
pink <- brocolors("crayons")["Tickle Me Pink"]

Use plot_crayons() to get the following summary plot of the colors:

Crayon colors

You can install the R/broman package using install_github in devtools, (specifically, install_github("kbroman/broman")) or wait a day or two and the version with this code will be on CRAN.

Update: See also Two more points about crayon colors.

Reform academic statistics

1 May 2014

Terry Speed recently gave a talk on the role of statisticians in “Big Data” initiatives (see the video or just look at the slides). He points to the history of statisticians’ discussions of massive data sets (e.g., the Proceedings of a 1998 NRC workshop on Massive data sets) and how this history is being ignored in the current Big Data hype, and that statisticians, generally, are being ignored.

I was thinking of writing a polemic on the need for reform of academic statistics and biostatistics, but in reading back over Simply Statistics posts, I’ve decided that Rafael Irizarry and Jeff Leek have already said what I wanted to say, and so I think I’ll just summarize their points.

Following the RSS Future of the Statistical Sciences Workshop, Rafael was quite optimistic about the prospects for academic statistics, as he noted considerable consensus on the following points:

  • We need to engage in real present-day problems
  • Computing should be a big part of our PhD curriculum
  • We need to deliver solutions
  • We need to improve our communication skills

Jeff said, “Data science only poses a threat to (bio)statistics if we don’t adapt,” and made the following series of proposals:

  • Remove some theoretical requirements and add computing requirements to statistics curricula.
  • Focus on statistical writing, presentation, and communication as a main part of the curriculum.
  • Focus on positive interactions with collaborators (being a scientist) rather than immediately going to the referee attitude.
  • Add a unit on translating scientific problems to statistical problems.
  • Add a unit on data munging and getting data from databases.
  • Integrating real and live data analyses into our curricula.
  • Make all our students create an R package (a data product) before they graduate.
  • Most important of all have a “big tent” attitude about what constitutes statistics.

I agree strongly with what they’ve written. To make it happen, we ultimately need to reform our values.

Currently, we (as a field) appear satisfied with

  • Papers that report new methods with no usable software
  • Applications that focus on toy problems
  • Talks that skip the details of the scientific context of a problem
  • Data visualizations that are both ugly and ineffective

Further, we tend to get more excited about the fanciness of a method than its usefulness.

We should value

  • Usefulness above fanciness
  • Tool building (e.g., usable software)
  • Data visualization
  • In-depth knowledge of the scientific context of a problem

In evaluating (bio)statistics faculty, we should consider not just the number of JASA or Biometrics papers they’ve published, but also whether they’ve made themselves useful, and to the scientific community and well as to other statisticians.