I am a data scientist

Three years ago this week, I wrote a blog post, “Data science is statistics”. I was fiercely against the term at that time, as I felt that we already had a data science, and it was called Statistics.

It was a short post, so I might as well quote the whole thing:

When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.

If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

If you say that one kind of data analysis is statistics and another kind is not, you’re not allowing innovation. We need to define the field broadly.

You may not like what some statisticians do. You may feel they don’t share your values. They may embarrass you. But that shouldn’t lead us to abandon the term “statistics”.

I still sort of feel that way, but I must admit that my definition of “statistics” is rather different than most others’ definition. In my view, a good statistician will consider all aspects of the data analysis process:

  • the broader context of a scientific question
  • study design
  • data handling, organization, and integration
  • data cleaning
  • data visualization
  • exploratory data analysis
  • formal inference methods
  • clear communication of results
  • development of useful and trustworthy software tools
  • actually answering real questions

I’m sure I missed some things there, but my main point is that most academic statisticians focus solely on developing “sophisticated” methods for formal inference, and while I agree that that is an important piece, in my experience as an applied statistician, the other aspects are often of vastly greater importance. In many cases, we don’t need to develop sophisticated new methods, and most of my effort is devoted to the other aspects, and these are generally treated as being unworthy of consideration by academic statisticians.

As I wrote in a later post, “Reform academic statistics”, we as a field appear satisfied with

  • Papers that report new methods with no usable software
  • Applications that focus on toy problems
  • Talks that skip the details of the scientific context of a problem
  • Data visualizations that are both ugly and ineffective

Discussions of Data Science generally recognize the full range of activities that are required for the analysis of data, and place greater value on such things as data visualization and software tools which are obviously important but not viewed so by many statisticians.

And so I’ve come to embrace the term Data Science.

Data Science is also a much more straightforward and understandable label for what I do. I don’t think we should need a new term, and I think we should argue against misunderstandings of Statistics rather than slink off to a new “brand”. But in general, when I talk about Data Science, I feel I can better trust that folks will understand that I am talking about the broad set of activities required in good data analysis.

If people ask me what I do, I’ll continue to say that I’m a Statistician, even though I do tend to stumble over the word. But I am also a Data Scientist.

One last thing: I’ve also come to realize that computer science folks working in computational biology are really just like me. They have expertise in a somewhat different set of tools, but then that’s true for pretty much every statistician, too: they’re much like me but they have expertise in a somewhat different set of tools. And it’s nice to be able to say that we’re all data scientists.

It should be recognized, too, that academic computer science suffers from many of the same problems that academic statistics has suffered: an overemphasis on novelty, sophistication, and toy applications, and an under-appreciation for solving real problems, for data visualization, and for useful software tools.

Tags: ,

10 Responses to “I am a data scientist”

  1. Bryan Britten Says:

    Do you feel as though data science has grown to implicitly mean that the analyst only deals with large data over highly distributed systems? Does it now mean that they’re primarily focused on machine learning and cutting edge data visualization?

    What about the analysts that deal with what I’ve been calling “micro data”? Maybe 2,000 pages of bank statements in a PDF that needs to be parsed, cleaned, and analyzed. Or webscraping forums for comments on which to perform text analytics? Is this data science, statistics, or something else? I’ve struggled with the notion of calling myself a data scientist, but I don’t think I’m a statistician either. I almost feel like “data hacker” is a more appropriate term as its more about munging what’s available into what I want.

    My question, in a nutshell, is do you think the term “data scientist” will continue to grow such that it no longer overshadows “statiatician” and there become two clear roles?

    • Karl Broman Says:

      I wouldn’t say data science is exclusively concerning massive data. I don’t think the analysis of massive data sets is fundamentally different from the analysis of normal-sized data.

      Data hacker is a cool term. I say go with it. Much better than data munger, in my opinion.

      Personally, I’d prefer fewer labels, but also that people can participate in the full process of data analysis, rather than have separate, tightly defined roles. It’s more fun that way. I’d just like people to recognize that there can be as much creativity involved in parsing 2,000 pages of bank statements as there is in a proof of consistency.

  2. Robert Flight Says:

    I think this jives very well with Donoho’s 50 years of data science http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf

  3. Ruth Duerr Says:

    I have been a “data scientist” for many years now (OK – more or less my entire working life); but I have never been a statistician and moreover, at this point, I am not interested in becoming a statistician. That said, I work with data, make it available and useful for others to use, make it discoverable, understandable and accessible; as well as to ensure that it actually still exists in usable form for years into the future, forms that vary depending on the audience and their needs. The Data Science Journal is one of the places I publish – that journal has been around since at least 2002. It doesn’t often publish articles by “statisticians” and I would hope that it never does.

    I am the flip side of Karl Broman. A person who doesn’t like the way “statisticians” seem to be taking over the data science term. A person who doesn’t like the way “big data” is attempting to take it over either. These things do all exist; but I would much prefer that Karl keep his “statistician” hat. the “big data” folks keep theirs and allow me to keep my “data science” hat.

    I also don’t like having the very valuable work folks like I do denigrated into “garbage collection” as I’ve heard some call it (it’s only garbage if you treat it that way – folks like I are why data can become a valuable resource).

  4. The Data Science Delusion – Cloud Data Architect Says:

    […] Karl Broman. I am a data scientist. https://kbroman.wordpress.com/2016/04/08/i-am-a-data-scientist/, […]

  5. The Data Science Delusion - Launchship Says:

    […] a laudable goal to move away from the extreme specialization of traditional or academic research [4], but has there been an […]

  6. The Data Science Delusion - Use-R!Use-R! Says:

    […] a laudable goal to move away from the extreme specialization of traditional or academic research [4], but has there been an […]

  7. I am a data scientist | codefying Says:

    […] Source: I am a data scientist […]

Comments are closed.

%d bloggers like this: