Posts Tagged ‘data science’

I am a data scientist

8 Apr 2016

Three years ago this week, I wrote a blog post, “Data science is statistics”. I was fiercely against the term at that time, as I felt that we already had a data science, and it was called Statistics.

It was a short post, so I might as well quote the whole thing:

When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.

If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

If you say that one kind of data analysis is statistics and another kind is not, you’re not allowing innovation. We need to define the field broadly.

You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term “statistics”.

I still sort of feel that way, but I must admit that my definition of “statistics” is rather different than most others’ definition. In my view, a good statistician will consider all aspects of the data analysis process:

  • the broader context of a scientific question
  • study design
  • data handling, organization, and integration
  • data cleaning
  • data visualization
  • exploratory data analysis
  • formal inference methods
  • clear communication of results
  • development of useful and trustworthy software tools
  • actually answering real questions

I’m sure I missed some things there, but my main point is that most academic statisticians focus solely on developing “sophisticated” methods for formal inference, and while I agree that that is an important piece, in my experience as an applied statistician, the other aspects are often of vastly greater importance. In many cases, we don’t need to develop sophisticated new methods, and most of my effort is devoted to the other aspects, and these are generally treated as being unworthy of consideration by academic statisticians.

As I wrote in a later post, “Reform academic statistics”, we as a field appear satisfied with

  • Papers that report new methods with no usable software
  • Applications that focus on toy problems
  • Talks that skip the details of the scientific context of a problem
  • Data visualizations that are both ugly and ineffective

Discussions of Data Science generally recognize the full range of activities that are required for the analysis of data, and place greater value on such things as data visualization and software tools which are obviously important but not viewed so by many statisticians.

And so I’ve come to embrace the term Data Science.

Data Science is also a much more straightforward and understandable label for what I do. I don’t think we should need a new term, and I think we should argue against misunderstandings of Statistics rather than slink off to a new “brand”. But in general, when I talk about Data Science, I feel I can better trust that folks will understand that I am talking about the broad set of activities required in good data analysis.

If people ask me what I do, I’ll continue to say that I’m a Statistician, even though I do tend to stumble over the word. But I am also a Data Scientist.

One last thing: I’ve also come to realize that computer science folks working in computational biology are really just like me. They have expertise in a somewhat different set of tools, but then that’s true for pretty much every statistician, too: they’re much like me but they have expertise in a somewhat different set of tools. And it’s nice to be able to say that we’re all data scientists.

It should be recognized, too, that academic computer science suffers from many of the same problems that academic statistics has suffered: an overemphasis on novelty, sophistication, and toy applications, and an under-appreciation for solving real problems, for data visualization, and for useful software tools.

Initial steps towards reproducible research

4 Dec 2014

In anticipation of next week’s Reproducible Science Hackathon at NESCent, I was thinking about Christie Bahlai’s post on “Baby steps for the open-curious.”

Moving from Ye Olde Standard Computational Science Practice to a fully reproducible workflow seems a monumental task, but partially reproducible is better than not-at-all reproducible, and it’d be good to give people some advice on how to get started – to encourage them to get started.

So, I spent some time today writing another of my minimal tutorials, on initial steps towards reproducible research.

It’s a bit rough, and it could really use some examples, but it helped me to get my thoughts together for the Hackathon and hopefully will be useful to people (and something to build upon).

Reform academic statistics

1 May 2014

Terry Speed recently gave a talk on the role of statisticians in “Big Data” initiatives (see the video or just look at the slides). He points to the history of statisticians’ discussions of massive data sets (e.g., the Proceedings of a 1998 NRC workshop on Massive data sets) and how this history is being ignored in the current Big Data hype, and that statisticians, generally, are being ignored.

I was thinking of writing a polemic on the need for reform of academic statistics and biostatistics, but in reading back over Simply Statistics posts, I’ve decided that Rafael Irizarry and Jeff Leek have already said what I wanted to say, and so I think I’ll just summarize their points.

Following the RSS Future of the Statistical Sciences Workshop, Rafael was quite optimistic about the prospects for academic statistics, as he noted considerable consensus on the following points:

  • We need to engage in real present-day problems
  • Computing should be a big part of our PhD curriculum
  • We need to deliver solutions
  • We need to improve our communication skills

Jeff said, “Data science only poses a threat to (bio)statistics if we don’t adapt,” and made the following series of proposals:

  • Remove some theoretical requirements and add computing requirements to statistics curricula.
  • Focus on statistical writing, presentation, and communication as a main part of the curriculum.
  • Focus on positive interactions with collaborators (being a scientist) rather than immediately going to the referee attitude.
  • Add a unit on translating scientific problems to statistical problems.
  • Add a unit on data munging and getting data from databases.
  • Integrating real and live data analyses into our curricula.
  • Make all our students create an R package (a data product) before they graduate.
  • Most important of all have a “big tent” attitude about what constitutes statistics.

I agree strongly with what they’ve written. To make it happen, we ultimately need to reform our values.

Currently, we (as a field) appear satisfied with

  • Papers that report new methods with no usable software
  • Applications that focus on toy problems
  • Talks that skip the details of the scientific context of a problem
  • Data visualizations that are both ugly and ineffective

Further, we tend to get more excited about the fanciness of a method than its usefulness.

We should value

  • Usefulness above fanciness
  • Tool building (e.g., usable software)
  • Data visualization
  • In-depth knowledge of the scientific context of a problem

In evaluating (bio)statistics faculty, we should consider not just the number of JASA or Biometrics papers they’ve published, but also whether they’ve made themselves useful, and to the scientific community and well as to other statisticians.

Data science is statistics

5 Apr 2013

When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.

If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

If you say that one kind of data analysis is statistics and another kind is not, you’re not allowing innovation. We need to define the field broadly.

You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term “statistics”.

NIH to hire Associate Director for Data Science

15 Jan 2013

See the press release.

Eric Green, currently head of NHGRI, has been appointed the Acting Associate Director for Data Science.

5/9 data journalists use Excel

15 Jan 2013

…if I’ve counted correctly.

I’d not heard the term “data journalist” before. It’s nice that they have an open handbook.