The term/position "data scientist" has become increasingly common, but it has still been unclear to me how someone becomes a data scientist, what it actually means, and what they evolved from. Of course, like most of the things I write about here in RTBlog and for RTM Daily, it's mostly new and I wanted to know more. So I figured who better to talk data science with than the now doused-in-awards Claudia Perlich from m6d?
Perlich, the chief data scientist at Media6Degrees, was recently honored with multiple 2013 awards. She was selected as a member of the Bellagio/PopTech Fellows Class of 2013, was announced as the general chair of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, and was named a winner of the American Marketer Association's 4 Under 40 award. She also won the Advertising Research Foundation's Grand Innovation Award.
RTM Daily: Congratulations on your recent achievements. m6d's CEO Tom Phillips called data science-driven advertising a "reinvention" of the market. In your opinion, what does that mean?
Perlich: Advertising has historically been a combination of creativity and instinct, trial and error, and some comparatively small scale market research. Very little of it was validated – we were stuck in Wanamaker’s world where we did not know which half worked, and which half was wasted. The change I see is comparable with the industrial revolution – and I am trying to be entirely value free about this. Today we are notably better at measuring effectiveness at a large scale and have the optimization – what works best – done by machines in a matter of seconds and hours rather than months.
With the immense granularity of data about consumer preferences, activities, etc. at our disposal, we have the opportunity to reach even more targeted consumers, who are truly interested in specific products right now. I am not a generic 35-45 year old soccer mom and as such, should not be targeted with baking products. I am much more than that. I am somebody else at each phase of the day: in the morning when I get my son ready for school, during my lunch break and in the evening when I plan my next conference trip. This is the promise and potential of a reinvention of advertising: it will be much more precise and effective, much more subject to success metrics, it will cross all devices and be honestly willing to engage on topics like privacy considerations and advertising fraud.
RTMD: Can you define data scientist? I always imagine someone that's extra good at Excel and understanding what the algorithms are finding.
Perlich: Data scientists come in many flavors and it is hard to find a “one size fits all” description. Somebody who is extra good at Excel and understanding algorithms could be that, but there is the ‘big’ in big data that demands more computer science skills as well. The most relevant characteristics to me are (aside from the tooling that I will speak to in a moment) an extreme curiosity, deep skepticism and ‘technical’ creativity.
A data scientist has to be skeptical and like Popper, understand that data proves nothing. Nothing can be more misleading than data because we associate it with truth. I see a large part of my day as a mix of a problem solver and a data detective that pokes around to identify inconsistencies that need to be resolved before the magic can happen. The other vital contribution of a good data scientist is to understand what the problem really is and whether it can be solved with the data at hand.
Consider a trivial example: I might be asked “what is the average age of our cookies?” The reality is that this question is entirely meaningless. And unless I understand what you want to use my answer for, I refuse to give you one. The truth is that the vast majority of cookies live for 0 seconds. They never get written on the device (which is something I do not know). Of course I can just take the average of that number – it is probably around hours. But mind you that of all the cookies that I do see for a second time, the average is closer to 90 days. So as a data scientist, I know that summarizing highly skewed distributions into single numbers is meaningless. So I have to help you find out why you need this information to find the ‘correct’ answer.
This is a really simple case of what I generally mean by using our data and algorithmic understanding to help shape the solutions and the tasks. That is a craft, as well as a science, that
goes well beyond being able to program Excel macros – data science lives in the intersection of understanding not just the results of the algorithms, but also the subtle caveats of their
applicability and the problem that should be solved. I sometimes feel like a new breed of matchmaker.
RTMD: Data scientist is to 2013 as ______ was to 2003. Fill in the
blank.
Perlich: ___geek____
RTMD: Data scientist is to 2013 as ______ will be to 2023. Fill in the blank.
Perlich: ___???___
RTMD: How does one go about becoming a "data scientist"?
Perlich: Ultimately, you need to have curiosity and skepticism. What you can learn is the skill and to some extent the intuition for data. The only way to acquire the intuition is to get your hands dirty with many different data sets and ‘experience’ data. In terms of skill, there is some clear tooling that you need to learn and requires some CS background. The tools have to touch on a) data extraction (SQL, APIs, etc.), basic environment and manipulation (UNIX), scripting (Perl, Python), analytics (R, standalones), visualization tools (R, Pajek, D3, Cytoscape). But beyond that, the tooling is the experience. For the ‘craft’ part of being a data scientist, you need a good apprenticeship – somebody who will teach you the practice of it.
RTMD: Data scientist vs. Old-school marketer vs. Algorithms. What unique thing does each bring to the table?
Perlich: First and foremost, the algorithms are just tools of the data scientist, as a hammer is to the blacksmith. They may have become incredibly powerful, but they are just tools nevertheless.
The other thing people often overlook in the algorithm discussion is the interplay of data and algorithm. The result is NOT determined by the algorithm, it is determined by both, the data going in and the transformation with which the algorithm shapes the answer. So let’s keep the data scientist and the algorithm in the same bucket and compare it to the Old School marketer.
Ultimately, there is a part in advertising that will forever remain the realm of human ingenuity: creativity, crafting a message, captivating the audience in that short opportunity of interaction. That role will always be owned by the marketer. The data can help compare the effectiveness of different creative/videos/etc., but it will never be able to substitute for human creativity. What I envision is a much more tightly connected collaboration where the algorithm can give feedback about different consumer groups to guide the different ways of crafting the message. I still envision the message being created by the marketer.
RTMD: Do people in your position replace anyone/anything at a company, or are you simply filling a role that wasn't required before?
Perlich: Typically, I am interested in building things that never existed. That does not necessarily mean that it was not required – it just did not exist because it was either too expensive, had not been invented or the necessary things to build it weren’t available.
Imagine the cruise control in your car. It is not exactly required but it is a really nice feature, and once one of your competitors has it, you’d better get on board or you will lose market share. A great example is the GPS. The issue was for the longest time twofold: cheap access to satellite data and cheap computational power to calculate a graph algorithm. Occasionally data-driven solutions can take over roles that used to be manual. But the majority of the gain from data is in scale – supporting decisions that happen millions of times per day, not replacing human decisions that are done once per week.
RTMD: Are there data science conventions? Where do you learn and get better?
Perlich: Yes, there are a number of academic conferences that have excellent analytical content and serve as an exchange of ideas on both the methodology and application of data science. My favorite one is KDD (Knowledge Discovery and Data Mining) – in fact, I will be organizing the 20th KDD next August here in NYC. Other academic conferences are ICML, ECML, AAAI, SDM, NIPS and some more industry driven events like DataGotham, STRATA, PAW.
RTMD: If you had to describe Big Data in just one word, what would that word be?
Perlich: Challenge.
RTMD: Thank you for your time.
Re: "A data scientist has to be skeptical and like Popper, understand that data proves nothing". Not IMHO. What Popper actually believed was that something is scientific if it can be proved true/false by data (pedantically, by showing convincingly that it's false, or failing convincingly to do so), whereas something that can't be proved true/false is not scientific. Perhaps the quote was taken without its context.
A corollary is, as our ability to process data increases, more non-scientific issues become scientific. I'll create a new law: "Science expands to fill the data available".