In my last post, I took a shot at defining what a data science role actually is or might be. My goal wasn’t to try and figure out from the hugely varied, largely confused and often contradictory public discourse a definition based on consensus. Instead, I wanted to layout my own vision for the actual roles required in an organization to do analytics and see if any of them fit the broader, fuzzy data science term reasonably well. Given that approach, I had to define all the roles involved in enterprise analytics efforts (outside of IT and program management type stuff), which was perhaps a little bit ambitious. Here in snapshot form is what I came up with:
Based on this spectrum of roles, I advocated for a definition of the data scientist role focused on the “Statistical ETL and Analytics Foundation”. This role uses advanced analytics to describe the fundamental patterns in the data which are then standardized in a higher-level data model and consumed and used to drive subsequent business analysis. In digital, this type of work would be the creation of the visit-types (what we call a Two-Tiered Segmentation) or units-of-work that describe what the low-level web or mobile behaviors represent. With something like the IoT, this would be the core interpretation of what the usage pattern meant (e.g. for a fitness app whether a pattern represented somebody climbing stairs or riding an escalator).
What I like about this description is that it identifies a somewhat new type of advanced analytics activity that’s deeply related to big data and isn’t well done or represented in the enterprise. In other words, it helps fill a real gap in the analytics organization and isn’t just a fancy and expensive rename for your existing analysts. I also like this because it removes much of ambiguity about what you need when you hire a data scientist. You don’t need a data journalist, a database architect, an ETL specialist or a full-stack analyst. Ideally, you need a pretty strong statistical modeler with knowledge of pattern-matching or algorithmic techniques appropriate to your data and enough sense or experience to understand the types of patterns that might be important. I hope that sounds a little less impossible to find.
Surprisingly, I didn’t get a huge amount of push-back on this definition though there was a fair amount of debate about the use of the term data scientist at all. What I did get, though, was an interesting question around one of the other roles: Taxonomy and Meta Data Systems.
Here’s the query:
“I was wondering if you could expand upon the "Taxonomy & Meta-Data Systems" section a bit for me as I'm not sure I understand exactly what you are referring to. I'm not really sure what is being classifying here and what the meta-data is referring to. How does it differ from the ETL in the previous step? Would it be possible to provide concrete examples relating to this step?”
This role wasn’t central to my thesis and, on re-reading the post, I have to agree that I didn’t do a great job explicating it.
Before I take a crack at explaining the Taxonomy role, I want to caveat that I’m not sure that meta-data and taxonomy are everywhere as important or complex as they are in digital. I’m not sure they aren’t either – but much of what I’m going to talk about is highly specific to digital.
So here goes.
There’s a presentation I developed for X Change that delves into the foundation of digital analytics. At the most fundamental level, digital analytics works by assuming that the content people consume was CHOSEN by them and is indicative of their preference. This assumption is profoundly important. To illustrate how this assumption changes the metrics and approach we use to understanding behavior, I use a little example I call Conan the Librarian (with thanks to imghumour.com).
Before Conan can read (some librarian!), here’s his view of the world.
This is the equivalent in digital analytics of metrics based on page views, time spent, and visits. Using these simple metrics, we can get a certain kind of insight into our patrons. But mostly, we are just fooling ourselves.
Once Conan learns to read, he gets a totally different view of the world:
In a world where content consumption is driven by user choice, understanding the CONTENT is critical. Conan’s understanding of his patrons after he learns to read is like night and day. He understands what users want, what else they might like, and much, much more about who they are. Things like whether they are in school, have kids, are checking-out for someone else, or are going on vacation just fall out of the content understanding. What’s more, he understands that many of his conclusions about patrons based on the simple consumption metrics were simply mistaken. The “heaviest” reader is actually checking out books for their child. The “light” reader is consuming hard, serious literature.
Learning to read is, here, just a proxy for understanding the content that’s consumed and then self-selected. Building and developing that content understanding is what I mean when I describe a role for meta-data and taxonomy in the analytics organization.
In the digital world, this role is especially important because of the amount of content we have to classify and the potential complexity of that content. We frequently analyze sites that push 100’s or even 1000’s of pieces of new content live each week. Most of the sites we analyze have tens of thousands of pages that can be consumed or hundreds of thousands of products. If you’re going to analyze and understand users of those sites, you need to understand lots and lots of content.
Now I know what a bunch of you are thinking. I have this done already. Or (even worse), this is just data governance. No you don’t (unless you’re a Netflix equivalent) and no it isn’t (no matter who you are). You may indeed have a basic content classification on your site: topic and sub-topic or category and sub-category. That isn’t nearly enough for good analysis. In a post I wrote some time back, I outlined a bunch of different approaches to taxonomy that most people ignore. Give it a read.
But keep in mind that even this only scratches the surface. For most Websites, content is the engine that does the work. You should never stop thinking of ways to classify it. Do you use enough verbs? Is your reader just getting started on a topic? Is your content written at the right reader level (this post comes in at a paltry 10th grade level)? Does the model in the picture have the right color hair? Is the column length right? Does the user care more about brand or price? Is the number of off-links appropriate? Is the ratio of images to text right? Is the product appropriate for gifts? Is the emotional tone appropriate for patients? Is a buyer a luxury shopper? Should a call to action be persuasive or exclamatory?
If you want to answer these questions…if you want to answer almost any question about users and content effectiveness via either analytics to controlled experimentation, you will need to explore and develop taxonomies. That’s not a job for data governance. It’s a job for analytics. Its purpose is analytic. Its understanding is analytic. And, since you mostly won’t we doing manual classifications on large sites, its methods are analytic.
That’s why I’ve carved out an explicit role in the analytics team for meta-data and taxonomies. At least when it comes to digital, I want someone whose sole job is thinking about how to classify content, who is constantly testing for new and innovative ways to do that, and who will make those categorizations available to the data scientists and business analysts who can use them.
If the data science role I outlined around statistical ETL and pattern-building is a significant gap in most analytics teams, this role around meta-data is even less well served and yet, in digital, is every bit as fundamental.
By the way, I couldn’t end this piece without highlighting a comment by EY’s own Micah Greer on the data scientist. “It's a bit of a cheap shot, but even the title of the role is poorly formed, as no proficiency in the enterprise of science is required. Unusually Smart Data Engineer maybe?”
May we all be (like Micah, I might add), unusually smart data engineers!