Am I a data scientist?

Last night I gave a very short talk (less than 5 minutes) at the Melbourne Analytics Charity Christmas Gala, a combined event of the Statistical Society of Australia, Data Science Melbourne, Big Data Analytics and Melbourne Users of R Network.

This is (roughly) what I said.


Statisticians seem to go through regular periods of existential crisis as they worry about other groups of people who do data analysis. A common theme is: all these other people (usually computer scientists) are doing our job! Don’t they know that statisticians are the best people to do data analysis? How dare they take over our discipline!

I take a completely different view. I think our discipline is in the best position it has ever been in. The demand for data analysis skills is greater than ever. Our graduates are highly sought after, and well paid. Being a statistician has even been described as a sexy profession (which presumably is a good thing to be!).

The different perspectives are all about inclusiveness. If we treat statistics as a narrow discipline, fitting models to data, and studying the properties of those models, then statistics is in trouble. But if we treat what we do as a broad discipline involving data analysis and understanding uncertainty, then the future is incredibly bright.

Here are two quotes from well-known bloggers in the last year or two:

April 2013: Larry Wasserman blog
Data science: the end of statistics?
If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

November 2013: Andrew Gelman blog
Statistics is the least important part of data science
There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics as a subset of data science …

Statistics is important—don’t get me wrong—statistics helps us correct biases … estimate causal effects … regularize so that we’re not overwhelmed by noise … fit models … visualize data … I love statistics! But it’s not the most important part of data science, or even close.

How can two professors of statistics have such different views on their discipline? The same perspectives can be seen in the following two diagrams (both reproduced with permission).

Data_Science_VD

Source: Drew Conway, Sept 2010. Reproduced under a Creative Commons Licence.

Venn-Diagram-of-Data-Scientist-Skills-

In the first narrow view, to be a data scientist you have to know a great deal about statistics, mathematics, computer science, programming, and the application discipline. If that’s true, I’ve never met a data scientist. I don’t believe they exist.

In the second broader view, everyone here is a data scientist, although we have different specializations and different perspectives and training.

I take the broad inclusive view. I am a data scientist because I do data analysis, and I do research on the methodology of data analysis. The way I would express it is that I’m a data scientist with a statistical perspective and training. Other data scientists will have different perspectives and different training.

We are comfortable with having medical specialists, and we will go to a GP, endocrinologist, physiotherapist, etc., when we have medical problems. We also need to take a team perspective on data science.

None of us can realistically cover the whole field, and so we specialise on certain problems and techniques. It is crazy to think that a doctor must know everything, and it is just as crazy to think a data scientist should be an expert in statistics, mathematics, computing, programming, the application discipline, etc. Instead, we need teams of data scientists with different skills, with each being aware of the boundary of their expertise, and who to call in for help when required.

Let’s not be too sectarian about our disciplines, thinking everyone not trained in the same way we were is a heretic.

It reminds me of a famous joke, written by comedian Emo Philips:

I was walking across a bridge one day, and I saw a man standing on the edge, about to jump off. I immediately ran over and said “Stop! Don’t do it!”
“Why shouldn’t I?” he said.
I said, “Well, there’s so much to live for!”
“Like what?”
“Well … are you religious or atheist?”
“Religious.”
“Me too! Are you Christian or Jewish?”
“Christian.”
“Me too! Are you Catholic or Protestant?”
“Protestant.”
“Me too! What franchise?”
“Baptist.”
“Wow! Me too! Northern Baptist or Southern Baptist?”
“Northern Baptist”
“Me too! Are you Northern Conservative Baptist or Northern Liberal Baptist?”
“Northern Conservative Baptist”
“Me too! Are you Northern Conservative Fundamentalist Baptist or Northern Conservative Reformed Baptist?”
“Northern Conservative Fundamentalist Baptist”
To which I said, “Die, heretic scum!” and pushed him off.


Related Posts:


  • zbicyclist

    Great post, and I’ve always loved that Emo Phillips joke.

    Statisticians (and related disciplines) don’t have union cards and union work rules. In applied settings, whoever is able to solve the problem most appropriately (for some value of “most appropriately”) will end up owning the problem, regardless of their initial affiliation.

    To judge from my many years of experience (and the vast majority of questions on CrossValidated), the problem is mostly solved if you can figure out what the problem is and properly frame it. The wide view is appropriate given the need to have a broad amount of general training to diagnose the problem.

  • Thomas Speidel

    At first I was a bit skeptical, the more I read your article the more I sympathize with it.

    However, let’s not forget who statisticians are and where they came from. Before the term data science existed, before it was “cool” to be a statistician, statisticians had little help: they had to venture in the computer science circle of the Venn diagram to solve problems (R is living proof of this need). Most applied statisticians I know do not enjoy doing data cleaning and management, yet, for several decades, they had to do precisely that (record linkage in health research or in the census is an example). In a time when no apps existed and there was no RStudio Shiny, they still managed to create tools to use results from a predictive model (for example, nomograms or even interactive nomograms in S-PLUS).

    To many the frustration is that so much of our profession is miss-characterized, wilfully misunderstood and reinvented.

    To continue with the medical analogy, it’s as if a family doc would attempt open heart surgery on one of his patients. This problem is disproportionately in one direction: seldom does it occur that the family doc refers the patient to the specialist for a simple cold. So, it’s not the ability to recognize problems that’s at stake. Rather, one’s (over)confidence in solving problems with the tools they have. This perception is quite obvious on the internet (death of statistician, unicorns everywhere).

    It is true that we can’t realistically cover the field. We need to collaborate, but in order to do so, everyone needs a clear understanding of what they’re bringing to the table.

    • Good points. If I ever give the talk again, I will cite you! I was saying something similar, although much too briefly, in my comment “with each being aware of the bound­ary of their exper­tise, and who to call in for help when required.”

      • Will you feel able to mention my view?

      • With your point in mind I believe I will maintain the ‘Scientist’ at least until we have
        discovered the whole shape of this discipline, enabling us to, with enlightened
        eyes, decide who is best positioned to claim responsibility for which data set
        & also how that data set should be manipulated, stored & presented.

  • buggyfunbunny

    As my Momma used to say, “it is wise to never let the children touch the sharp cutlery”.

    The problem with self-appointed data scientists is that we get messes. The problem began, at least, with the Management Suit who would tell “his girl” to run that 1-2-3 macro-laden spreadsheet (some macros his, some hers, and some he found lying around the bulletin boards) for the capex presentation (or Fed meeting or …). No one seemed to notice that the numbers didn’t quite add up; until it was too late. Today, we have Li and his copula and the London Whale and his Excel macros. The failure in Li’s case was that others who presented as data scientists were clueless, but I’d argue that the copula as such is much too easy to punt. The Whale, by all accounts, simply made it up as he went along.

    Even pros make messes: there was the Mars Climate Orbiter that failed because some segment of code had the unit measure backward. But that doesn’t mean self-appointed data scientists should muck around in data whose use/interpretation/decision driving can harm others if messed up. I couldn’t care less if Twitter goes down for a week because the data scientists messed up the advert cycles or somesuch. I do care if clueless folks, particularly those with an agenda, bend macro-economic data to that agenda just because the tools are available to allow any knucklehead to do and claim ‘data science’ as justification.

    And, no, we don’t need hackers, even in the classic definition (the one before it became the synonym for cracker). I’m becoming more certain that data science boils down to self-taught operations research, which at one time was viewed by the stat crowd as the prime interloper out to steal their women.

    And the best Data Science book is Janert’s (sorry Rob).

  • Randeroid

    The purpose of a Venn Diagram is to help you think about sets that fit into a Venn Diagram. After seeing these particular Venn Diagrams for a while, it is apparent that this topics does not fit.

  • Randeroid

    RE: How can two pro­fes­sors of sta­tis­tics have such dif­fer­ent views on their dis­ci­pline?

    RESP: Statisticians are heterogeneous. I wish to make one point and it regards more blogs and articles than just this nicely written one above. The views of statistics professors and applied statisticians are not in alignment. Hence, as we have seen many times, it is misleading to quote statistics professors as if they are the experts on what we ALL think or do.

    How so, you may well ask. The focus of professors of statistics is to publish scholarly works. The center of gravity for applied statisticians (in the field) is to analyze data–all kinds of data, under all kinds of circumstances The perspective of statistics professors regarding the relationship between data science and statistics will lean toward what various groups are publishing. Many of these publications contain proofs and no data–this means no data analysis either. Also, statistics professors are less likely to republish what we already know. Professors in other fields are unfamiliar with the statistics literature and will republish ‘old news,’ … or data science material. Some statistics professors do not want to be associated with that. Next, I will add that some statistics professors are unfamiliar with what is going on in the field AND YET, they have no inhibitions in speaking their mind about it. Finally, statistics professors are not homogeneous either and Larry Wasserman gets it.

    For most of us, when we consider the relationship between data science and statistics, these publishing issues are not top of mind.

    Quotes from two applied statisticians would better resonate with those of us in the field. Even so, you can find two who disagree on anything. To understand what is happening in the field, we need a carefully designed survey of those in the field.

  • Harold Baize

    Great post. I also have a problem with the term “data scientist.” The thing that is lacking is… science! I have read a couple of popular books about doing data science and at no point in these books was there a discussion of applying the scientific method. Science is a method, not an honorary title. It seems as if some data analysts just want more recognition for what they do. I am trained as a social scientist, when I apply the scientific method to the study of psychology I am a scientist, when I perform data analysis, I’m an analyst, no matter how advance those analytic techniques may be.

    • I can agree with the dropping of science, please read my posts, you’ll see I may be approaching this from a fresh angle

  • Randeroid

    RE: A com­mon theme is: all these other peo­ple (usu­ally com­puter
    sci­en­tists) are doing our job! Don’t they know that sta­tis­ti­cians
    are the best peo­ple to do data analy­sis?
    RESP: Computer scientists are only the latest and there is little precedent for their involvement. Usually, it is just about everyone else. Consumers of data analysis have been plagued by a dearth of qualified professionals and trouble identifying qualifications. Also, specialists should be led by specialist, like in accounting, finance, et al. Failures due to having the wrong people making statistics decisions include: AIG, Fannie Mae, Moody’s, S&P, Fitch, et al. That is the harm.

    • Can I Interest you in commenting on my perspective, possibly a view that tackles your computer scientist complaint.

      • Randeroid

        Sure, see my remarks above, and I am a computer scientist too (I have no complaint with myself at this time, except occasionally I wish I had gone into the National Park Service).

        We computer scientists have been receiving disinformation for a long time now from carnival barkers claiming to know what statistics is and is not; and what ALL statisticians do, think, do not do, do not think, et al. We must be careful not to place any weight on feel-good social-media disinformation (including rags like Wired and Information Magazine) that is readily contradicted by our real life experiences. Unfortunately, we are enabling this behavior. Gaining a deep understanding of statistics (data analysis) will require an investment. No matter how many new terms like ‘data science’ straddle them, there remain two distinct applications: data management and data analysis. Mathematics (certainty) and statistics (partial information) are two distinct tool boxes. Finally, if data science includes data analysis (statistical data science), then it includes statistics.

  • Ralph Winters

    Replace the Venn Diagram with 1) Statistician 2) Programmer 3) Database Person and 4) Business Analyst/Project Manager (to do datasciencey things like ‘story telling’ and to seek out ‘subject matter expertise’), and you have simplified things greatly. No overlap needed. Just 4 professionals respecting each other. This is the way it used to be. The only thing that has really changed is that open software has enabled everyone to do everything. Other than that, these 4 titles can do everything in either of the diagrams. Given the right people of course!

  • I propose
    Florence Nightingale (Data Diagrams),
    Charles Babbage (Computers (Hardware))
    Alan Turing (Computers (Software)) …
    as Honorary Patrons it may help set the scene.

    and Thinking more about this perhaps Tim Berners-Lee should be the fourth for the World Wide Web, I’ll let him speak for himself though.

    • Randeroid

      Please add John Vincent Atanasoff.

      • With pleasure, I just read his Wiki page (not always wise, a
        good start point though) He certainly fits the list, very quickly we are aware
        the complete honorary list will be extensive. I still hold that as Data
        Managers, scientist or not, this page is poorly structured to properly deal
        with discussion, also I think we will need all Academic Disciplines to
        represent their case, if we stand a chance of understanding the whole data set
        & it’s correct subdivisions.

  • Quote from my Facebook
    “Data Management Science,
    Computing Is A Sub-Discipline. After proposing this new Educational Discipline I suppose I should provide more info, my intent is that it slot in between Language & Maths as the third core subject”
    I apologise, I at the time thought I was the only one, it’s nice to be in a crowd

    To continue, it’s my view that “Data Management” bridges the gap between Language & Maths & should be introduced to children from 5yrs, giving them a platform to negotiate the WWW by the age of exploration.

    Continuing further, Data Management is the skill we undertake before learning Language or Maths so it should take slight lead in the core (in my view)

  • With so much more to discuss, Cave Painters, Stonehenge, Werner Heisenberg this clearly is not the place for discussion. Hard to follow, see you guys around 😉

  • Guest
  • Adrian

    I suppose it depends on what kind of glasses/hat one wears. If one wears only
    his statistical glasses, he’ll see everything from the respective point of
    view. In other words, if one has a hammer, everything will look like a nail.

    The second quote recognizes that statistics is just a tool in the processes,
    that there are more important tools/components. From my point of view the
    weight falls especially on Programming, Visualization, Data/Text Mining,
    Machine Learning, Business and Tools Know-How. Nowadays the average scientist/professional
    must be statistically literate, fact that diminishes considerably the importance
    of Statistics when considered as part of a broader set of tools.

  • Cal H.

    Nowadays, computers are used extensively in research in the sciences, where Ph.D. students and postdocs are actually building applications to do their work. And, using Physics as an example, these people are already trained in Math and Statistics and have expertise in their field. So does this make modern scientists data scientists as well? Albeit, I have not heard of machine learning being used in modern research, though that may change in the future.