I am a data scientist

Three years ago this week, I wrote a blog post, “Data science is statistics”. I was fiercely against the term at that time, as I felt that we already had a data science, and it was called Statistics.

It was a short post, so I might as well quote the whole thing:

When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.

If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

If you say that one kind of data analysis is statistics and another kind is not, you’re not allowing innovation. We need to define the field broadly.

You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term “statistics”.

I still sort of feel that way, but I must admit that my definition of “statistics” is rather different than most others’ definition. In my view, a good statistician will consider all aspects of the data analysis process:

the broader context of a scientific question
study design
data handling, organization, and integration
data cleaning
data visualization
exploratory data analysis
formal inference methods
clear communication of results
development of useful and trustworthy software tools
actually answering real questions

I’m sure I missed some things there, but my main point is that most academic statisticians focus solely on developing “sophisticated” methods for formal inference, and while I agree that that is an important piece, in my experience as an applied statistician, the other aspects are often of vastly greater importance. In many cases, we don’t need to develop sophisticated new methods, and most of my effort is devoted to the other aspects, and these are generally treated as being unworthy of consideration by academic statisticians.

As I wrote in a later post, “Reform academic statistics”, we as a field appear satisfied with

Papers that report new methods with no usable software
Applications that focus on toy problems
Talks that skip the details of the scientific context of a problem
Data visualizations that are both ugly and ineffective

Discussions of Data Science generally recognize the full range of activities that are required for the analysis of data, and place greater value on such things as data visualization and software tools which are obviously important but not viewed so by many statisticians.

And so I’ve come to embrace the term Data Science.

Data Science is also a much more straightforward and understandable label for what I do. I don’t think we should need a new term, and I think we should argue against misunderstandings of Statistics rather than slink off to a new “brand”. But in general, when I talk about Data Science, I feel I can better trust that folks will understand that I am talking about the broad set of activities required in good data analysis.

If people ask me what I do, I’ll continue to say that I’m a Statistician, even though I do tend to stumble over the word. But I am also a Data Scientist.

One last thing: I’ve also come to realize that computer science folks working in computational biology are really just like me. They have expertise in a somewhat different set of tools, but then that’s true for pretty much every statistician, too: they’re much like me but they have expertise in a somewhat different set of tools. And it’s nice to be able to say that we’re all data scientists.

It should be recognized, too, that academic computer science suffers from many of the same problems that academic statistics has suffered: an overemphasis on novelty, sophistication, and toy applications, and an under-appreciation for solving real problems, for data visualization, and for useful software tools.

Tags: data science, statistics

This entry was posted on 8 Apr 2016 at 10:33 am and is filed under Statistics. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

10 Responses to “I am a data scientist”

Bryan Britten Says:
8 Apr 2016 at 11:10 am
Do you feel as though data science has grown to implicitly mean that the analyst only deals with large data over highly distributed systems? Does it now mean that they’re primarily focused on machine learning and cutting edge data visualization?

What about the analysts that deal with what I’ve been calling “micro data”? Maybe 2,000 pages of bank statements in a PDF that needs to be parsed, cleaned, and analyzed. Or webscraping forums for comments on which to perform text analytics? Is this data science, statistics, or something else? I’ve struggled with the notion of calling myself a data scientist, but I don’t think I’m a statistician either. I almost feel like “data hacker” is a more appropriate term as its more about munging what’s available into what I want.

My question, in a nutshell, is do you think the term “data scientist” will continue to grow such that it no longer overshadows “statiatician” and there become two clear roles?
- Karl Broman Says:
  8 Apr 2016 at 3:06 pm
  I wouldn’t say data science is exclusively concerning massive data. I don’t think the analysis of massive data sets is fundamentally different from the analysis of normal-sized data.
  
  Data hacker is a cool term. I say go with it. Much better than data munger, in my opinion.
  
  Personally, I’d prefer fewer labels, but also that people can participate in the full process of data analysis, rather than have separate, tightly defined roles. It’s more fun that way. I’d just like people to recognize that there can be as much creativity involved in parsing 2,000 pages of bank statements as there is in a proof of consistency.
Robert Flight Says:
8 Apr 2016 at 12:11 pm
I think this jives very well with Donoho’s 50 years of data science http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf
Ruth Duerr Says:
12 Apr 2016 at 5:55 pm
I have been a “data scientist” for many years now (OK – more or less my entire working life); but I have never been a statistician and moreover, at this point, I am not interested in becoming a statistician. That said, I work with data, make it available and useful for others to use, make it discoverable, understandable and accessible; as well as to ensure that it actually still exists in usable form for years into the future, forms that vary depending on the audience and their needs. The Data Science Journal is one of the places I publish – that journal has been around since at least 2002. It doesn’t often publish articles by “statisticians” and I would hope that it never does.

I am the flip side of Karl Broman. A person who doesn’t like the way “statisticians” seem to be taking over the data science term. A person who doesn’t like the way “big data” is attempting to take it over either. These things do all exist; but I would much prefer that Karl keep his “statistician” hat. the “big data” folks keep theirs and allow me to keep my “data science” hat.

I also don’t like having the very valuable work folks like I do denigrated into “garbage collection” as I’ve heard some call it (it’s only garbage if you treat it that way – folks like I are why data can become a valuable resource).
- Karl Broman Says:
  12 Apr 2016 at 6:20 pm
  Thanks for your comments; I appreciate your perspective.
  
  The worst thing would be if statisticians stole the “data science” term but didn’t actually revise their values (and courses).
  - Ruth Duerr Says:
    12 Apr 2016 at 6:50 pm
    Agreed!
The Data Science Delusion – Cloud Data Architect Says:
22 Nov 2016 at 2:53 am
[…] Karl Broman. I am a data scientist. https://kbroman.wordpress.com/2016/04/08/i-am-a-data-scientist/, […]
The Data Science Delusion - Launchship Says:
22 Nov 2016 at 4:34 pm
[…] a laudable goal to move away from the extreme specialization of traditional or academic research [4], but has there been an […]
The Data Science Delusion - Use-R!Use-R! Says:
30 Nov 2016 at 10:20 am
[…] a laudable goal to move away from the extreme specialization of traditional or academic research [4], but has there been an […]
I am a data scientist | codefying Says:
2 Dec 2016 at 10:40 am
[…] Source: I am a data scientist […]

Comments are closed.

The stupidest thing…

I am a data scientist

Related

10 Responses to “I am a data scientist”