If I could do it over again, I’d self-publish

12 Aug 2014

In 2009, Śaunak Sen and I wrote a book about QTL mapping and the R/qtl software. We started working on it in the fall of 2006, and it was a heck of a lot of work.

We’d talked to several publishers, and ended up publishing with Springer. John Kimmel was the editor we worked with; I like John, and I felt that Springer (or John) did a good job of keeping prices reasonable. We were able to publish in full color with a list price of $99, so that on Amazon it was about $65. (In April, 2013, there was a brief period where it was just $42 at Amazon!)

Springer did arrange several rounds of reviews; they typically pay reviewers $100 or a few books. But the copy editing was terrible (at the very least, you want a copy editor to read the book, and it was pretty clear that our copy editor hadn’t), and the actual type-setting and construction of the index was left to us, the authors.

It feels nice to have written a proper book, but I don’t think it makes that big of a difference, for me or for readers.

And John Kimmel has since left Springer to go to Chapman & Hall/CRC, and Springer has raised the price of our book to $169, so it’s now selling for $130 at Amazon. I think that’s obnoxious. It’s not like they’ve gone back and printed extra copies, so it’s hard to see how their costs could have gone up. But in the publishing agreement we signed, we gave Springer full rights to set the price of the book.

I have a hard time recommending the book at that price; I’m tempted to help people find pirated PDFs online. (And seriously, if you can’t find a pirated copy, you should work on your internet skills.)

I corresponded with an editor at Springer, on why our book has become so expensive and whether there’s anything we can do about it. They responded

  • If we do a new edition, it could be listed as $129.
  • If the book is adopted by university classes, “the pricing grid it is based on would have lower prices.”
  • Our book is available electronically, for purchase by chapter as well.

Purchase by chapter? Yeah, for $30 per chapter!

Springer has published books and allowed the authors to post a PDF, but only for really big sellers, and ours is definitely not in that category.

I’m both disgusted and embarrassed by this situation. If I could do it all over again, I’d self-publish: post everything on the web, and arrange some way for folks to have it printed cheaply.

Testing an R package’s interactive graphs

1 Aug 2014

I’ve been working on an R package, R/qtlcharts, with D3-based interactive graphs for quantitative trait locus mapping experiments.

Testing the interactive charts it produces is a bit of a pain. It seems like I pretty much have to just open a series of examples in a web browser and tab through them manually, checking that they look okay, that the interactions seem to work, and that they’re not giving any sort of errors.

But if I want to post the package to CRAN, it seems (from the CRAN policy) that the examples in the .Rd files shouldn’t be opening a web browser. Thus, I need to surround the example code with \dontrun{}.

But I was using those examples, and R CMD check, to open the series of examples for manual checking.

So, what I’ve decided to do:

  • Include examples opening a browser, but within \dontrun{} so the browser isn’t opened in R CMD check.
  • Also include examples that don’t open the browser, within \dontshow{}, so that R CMD check will at least check the basics.
  • Write a ruby script that pulls out all of the examples from the .Rd files, stripping off the \dontrun{} and \dontshow{} and pasting it all into a .R file.
  • Periodically run R CMD BATCH on that set of examples, to do the manual checking of the interactive graphs.

This will always be a bit of a pain, but with this approach I can do my manual testing in a straightforward way and still fulfill the CRAN policies.

Update: Hadley Wickham pointed me to \donttest{}, added in R ver 2.7 (in 2008). (More value from blog + twitter!)

So I replaced my \dontrun{} bits with \donttest{}. And I can use devtools::run_examples() to run all of the examples, for my manual checks.

UseR 2014, days 3-4

21 Jul 2014

Three weeks ago, I’d commented on the first two days of the UseR 2014 conference. I’m finally back to talk about the second half.

Dirk Eddelbuettel on Rcpp

Dirk Eddelbuettel gave a keynote on Rcpp [slides]. The goal of Rcpp is to have “the speed of C++ with the ease and clarity of R.” He gave a series of examples that left me (who still uses .C() to access C code) thinking, “Holy crap this is so much easier than what I do!”

Take a look at the Rcpp Gallery, and the slides from Dirk’s Rcpp tutorial.

Dirk ended with a detailed discussion of Docker: a system for virtual machines as portable containers. I didn’t fully appreciate this part, but according to Dirk, Docker “changes how we build and test R….It’s like pushing to GitHub.”

Sponsors Talk

After Dirk’s talk was the Sponsor’s Talk. But if I’m going to skip a session (and I strongly recommend that you skip at least some sessions at any conference), anything called ”Sponsor’s Talk“ is going to be high on my list to skip.

Lunch at Venice Beach

Karthik Ram and I met up with Hilary Parker and Sandy Griffith for lunch at Venice Beach.

It took us a bit longer to get back than we’d anticipated. But I did get a chance to meet up with Leonid Kruglyak at his office at UCLA.

R and reproducibility

David Smith from Revolution Analytics and JJ Allaire from RStudio each spoke about package management systems to enhance reproducibility with R.

For your R-based project to be reproducible, the many packages that you’ve used need to be available. And future versions of those packages may not work the same way, so ideally you should keep copies of the particular versions that you used.

David Smith spoke about the R reproducibility toolkit (RRT). The focus was more on business analytics, and the need to maintain a group of versioned packages that are known to work together. CRAN runs checks on all packages so that they’re all known to work together. As I understand it, RRT manages snapshots of sets of packages from CRAN.

JJ Alaire spoke about RStudio‘s solution: packrat. This seems more suitable for scientific work. It basically creates privates sets of packages, specific to a project.

I’ve not thought much about this issue. packrat seems the best fit for my sort of work. I should start using it.

Poster session

The second poster session was in a different location with more space. It was still a bit cramped, being in a hallway, but it was way better than the first day. There were a number of interesting posters, including Hilary’s on testdat, for testing CSV files; Sandy’s on using Shiny apps for teaching; and Mine Çetinkaya-Rundel and Andrew Bray’s poster on “Teaching data analysis in R through the lens of reproducibility“ [pdf].

Met more folks

The main purpose of conferences is to meet people. I was glad to be able to chat with Dirk Eddelbuettel, Ramnath Vaidyanathan, and also Tim Triche. Also karaoke with Sandy, Karthik, Hilary, Rasmus, and Romain.

Wish I’d seen

I had a bit of a late night on Wednesday night, and then I was in a hurry to get down (via public transit!) to the Science Center to meet up with my family. So I’m sorry that I didn’t get to see Yihui Xie‘s talk on Knitr Ninja.

Looking back through the program, there are a number of other talks I wish I’d seen:

Summary

UseR 2014 was a great meeting. In addition to the packages mentioned with Days 1-2, I need to pick up Rcpp and packrat.

Slides for many of the talks and tutorials are on the UseR 2014 web site. If you know of others, you can add them via the site’s GitHub repository and make a pull request.

Why hadn’t I written a function for that?

16 Jul 2014

I’m often typing the same bits of code over and over. Those bits of code really should be made into functions.

For example, I’m still using base graphics. (ggplot2 is on my “to do” list, really!) Often some things will be drawn with a slight overlap of the border of the plotting region. And in heatmaps with image, the border is often obscured. I want a nice black rectangle around the outside.

So I’ll write the following:

u <- par("usr")
rect(u[1], u[3], u[2], u[4])

I don’t know how many times I’ve typed that! Today I realized that I should put those two lines in a function add_border(). And then I added add_border() to my R/broman package.

It was a bit more work adding the Roxygen2 comments for the documentation, but now I’ve got a proper function that is easier to use and much more clear.

Update: @tpoi pointed out that box() does the same thing as my add_border(). My general point still stands, and this raises the additional point: twitter + blog → education.

I want to add, “I’m an idiot” but I think I’ll just say that there’s always more that I can learn about R. And I’ll remove add_border from R/broman and just use box().

2014 UseR conference, days 1-2

2 Jul 2014

I’m at UCLA for the UseR Conference. I attended once before, and I really enjoyed it. And I’m really enjoying this one. I’m learning a ton, and I find the talks very inspiring.

In my comments below, I give short shrift to some speakers (largely by not having attended their talks), and I’m critical in some places about the conference organization. Having co-organized a small conference last year, I appreciate the difficulties. I think the organizers of this meeting have done a great job, but there are some ways it which it might have been better (e.g., no tiny rooms, a better time slot for the posters, and more space for the posters).

Read the rest of this entry »

hipsteR: re-educating people who learned R before it was cool

15 May 2014

This morning, I started a tutorial for folks whose knowledge of R is (like mine) stuck in 2001.

Yesterday I started reading the Rcpp book, and on page 4 there’s an example using the R function replicate, which (a) I’d never heard before, and (b) is super useful.

I mean, I often write code like this, for a bootstrap:

x <- rnorm(2500)
sapply(1:1000, function(a) quantile(sample(x, replace=TRUE), c(0.025, 0.975)))

But I could just be writing

x <- rnorm(100)
replicate(1000, quantile(sample(x, replace=TRUE), c(0.025, 0.975)))

“Oh, replicate must be some new function.” Yeah, new in R version 1.8, in 2003!

I’m in serious need of some re-education (e.g., I should be using more of Hadley Wickham’s packages). Hence the beginnings of a tutorial.


Note: the title was suggested by Thomas Lumley. No connection to @hspter; it’s not really so hip. I probably should have written “geeksteR.”

Further points on crayon colors

9 May 2014

I saw this great post on crayola crayon colors at the Learning R blog, reproducing a nice graph of the Crayola crayon colors over time. (Also see this even nicer version.)

The Learning R post shows how to grab the crayon colors from the wikipedia page, “List of Crayola crayon colors,” directly in R. Here’s the code (after some slight modifications due to changes in the page since 2010):

library(XML)
theurl <- "http://en.wikipedia.org/wiki/List_of_Crayola_crayon_colors"
crayontable <- readHTMLTable(html, stringsAsFactors = FALSE)[[1]]
crayons <- crayontable[,grep("Hex", colnames(crayontable))]
names(crayons) <- crayontable[,"Color"]

Comparing these to what I’d grabbed, I noted one small discrepancy on the Wikipedia page: Yellow Orange was listed as "#FFAE42" but the background color for the Yellow Orange cell in the table was "#FFB653".

So I created a Wikipedia account and edited the Wikipedia page.

(Then I realized that I’d made a mistake in my edit, undid my change, thought the whole thing through again, and edited the page again.)

The Learning R post also showed a different way to sort the colors: convert to HSV, and then sort by the H then S then V. So I edited my plot_crayons() function again, to create the following picture:

Crayon colors, again

Two more points about crayon colors

8 May 2014

If you want to use crayon colors in R but you don’t want to rely on my R/broman package, you can just grab the code. Copy the relevant lines from the R/brocolors.R file:

crayons = c("Almond"="#efdecd",
            "Antique Brass"="#cd9575",
            "Apricot"="#fdd9b5",
            ...
            "Yellow Green"="#c5e384",
            "Yellow Orange"="#ffb653")

I spent a bit of time thinking about how best to sort the colors in a meaningful way, for the plot_crayons() function. But then decided to stop thinking and just do something brainless: measure distance between colors by RMS difference of the RGB values, and then use hierarchical clustering. Here’s the code from plot_crayons():

# get rgb 
colval <- t(col2rgb(crayons))

# hclust to order the colors
ord <- hclust(dist(colval))$order

It’s not perfect, but I think it worked remarkably well:

Crayon colors

Crayon colors in R

7 May 2014

Last night I was working on a talk on creating effective graphs. Mostly, I needed to update the colors, as there’d been some gaudy ones in its previous form (e.g., slide 22).

I usually pick colors using the crayons in the Mac Color Picker. But that has just 40 crayons, and I wanted more choices.

That led me to the list of Crayola crayon colors on wikipedia. I wrote a ruby script to grab the color names and codes and added them to my R/broman package.

Use brocolors("crayons") to get the list of colors. For example, to get “Tickle Me Pink,” use

library(broman)
pink <- brocolors("crayons")["Tickle Me Pink"]

Use plot_crayons() to get the following summary plot of the colors:

Crayon colors

You can install the R/broman package using install_github in devtools, (specifically, install_github("kbroman/broman")) or wait a day or two and the version with this code will be on CRAN.

Update: See also Two more points about crayon colors.

Reform academic statistics

1 May 2014

Terry Speed recently gave a talk on the role of statisticians in “Big Data” initiatives (see the video or just look at the slides). He points to the history of statisticians’ discussions of massive data sets (e.g., the Proceedings of a 1998 NRC workshop on Massive data sets) and how this history is being ignored in the current Big Data hype, and that statisticians, generally, are being ignored.

I was thinking of writing a polemic on the need for reform of academic statistics and biostatistics, but in reading back over Simply Statistics posts, I’ve decided that Rafael Irizarry and Jeff Leek have already said what I wanted to say, and so I think I’ll just summarize their points.

Following the RSS Future of the Statistical Sciences Workshop, Rafael was quite optimistic about the prospects for academic statistics, as he noted considerable consensus on the following points:

  • We need to engage in real present-day problems
  • Computing should be a big part of our PhD curriculum
  • We need to deliver solutions
  • We need to improve our communication skills

Jeff said, “Data science only poses a threat to (bio)statistics if we don’t adapt,” and made the following series of proposals:

  • Remove some theoretical requirements and add computing requirements to statistics curricula.
  • Focus on statistical writing, presentation, and communication as a main part of the curriculum.
  • Focus on positive interactions with collaborators (being a scientist) rather than immediately going to the referee attitude.
  • Add a unit on translating scientific problems to statistical problems.
  • Add a unit on data munging and getting data from databases.
  • Integrating real and live data analyses into our curricula.
  • Make all our students create an R package (a data product) before they graduate.
  • Most important of all have a “big tent” attitude about what constitutes statistics.

I agree strongly with what they’ve written. To make it happen, we ultimately need to reform our values.

Currently, we (as a field) appear satisfied with

  • Papers that report new methods with no usable software
  • Applications that focus on toy problems
  • Talks that skip the details of the scientific context of a problem
  • Data visualizations that are both ugly and ineffective

Further, we tend to get more excited about the fanciness of a method than its usefulness.

We should value

  • Usefulness above fanciness
  • Tool building (e.g., usable software)
  • Data visualization
  • In-depth knowledge of the scientific context of a problem

In evaluating (bio)statistics faculty, we should consider not just the number of JASA or Biometrics papers they’ve published, but also whether they’ve made themselves useful, and to the scientific community and well as to other statisticians.


Follow

Get every new post delivered to your Inbox.

Join 87 other followers