Since I'm about to pick fault with one small piece of it, I want to emphasize how much I enjoyed and how thought-provoking Ryan Caplan, President of ColdLIght Solution's, presentation, really was. Not only did I appreciate the relaxed style, but I think the content was unusually thought-provoking. What I at first thought might be another reprise on "what makes a good analyst" turned out to be a subtle approach to the question of the role of the analyst in the world of big-data, machine-learning and massive integration.
There are many different paths to analysis, but probably the classic path is to begin with a question, pick a method of analysis, and then determine which variables to use/assess within that method. It's a perfectly good procedure and it works just fine. As Ryan pointed out, however, it has problems of scale. It turns out that our big-data isn't just long, it's wide too. As we integrate more data points and more sources, the number of variables available for analysis grows and can quickly exceed the ability of any analyst or team to use in this traditional iterative fashion.
Fortunately, high-performance systems serve double duty for us here as well. They can work on lots of data (length) and they can process lots of columns (breadth) and figure out which ones matter. There are, in fact, a set of data analysis techniques specifically designed to look at the complex inter-relationships of many variables in an efficient manner. With these techniques, you can analyze hundreds or even thousands of independent variables to develop a model. An analyst can't to do this iteratively.
So the advent of high-performance analysis systems offers a significant variation on the traditional role of analyst. With standardized analysis methods and automated model-building across very large numbers of variables, the analyst is left with only "question generation" as a key task. Even that function may be partially subsumed in an automated, data-driven analysis. By finding unexpected relationships and influence patterns between variables, this type of analysis can actually generate questions (why does variable X correlate to variable Y?) - changing the analyst's role from someone who starts with a question and works toward an answer to someone who starts with an answer and works toward an explanation.
I'm a believer in the technologies and methods involved here. We've been doing neural network analysis and segmentation at Semphonic since our inception. There are, however, some important limitations on machine-learning and automated analysis that I think can be missed or poorly understood and I wanted to talk about that in light of one of the examples that Ryan gave.
I wasn't taking notes during the presentations, so I'm going to do my best to reconstruct this from memory. If I get it wrong, I'm thinking Ryan will probably correct me!
Comcast was looking to understand the performance of movies in Video on Demand (VOD). In particular, they were testing a hypothesis that box office performance would be the key correlate to VOD performance. Traditionally, an analyst would have tested this hypothesis directly. However, by running the analysis using these techniques for handling large numbers of variables, it was discovered that the first letter of the Movie's name was quite significant - and that starting with "A" was fairly predictive of VOD success.
It should take this audience only a moment to realize why that relationship exists: the VOD is a guided system organized alphabetically. In effect, you get a significant nudge since "A" is the first category of movies and the first set of movies you'll see. It's a common and well-understood UI phenomenon of the sort I described in this post on the intellectual foundation of Web analytics.
On the other hand, it's the type of relationship that an analyst might easily miss and an automated data analysis system manage to call out.
Or is it? What bothered me about the example is the somewhat implicit representation that these types of analysis systems will identify ANY pattern in the data - even the first letter of a movie name. That isn't really the case and it's important to understand why it isn't the case if you're really going to think about the role of the analyst when working with this type of technology.
Because here's the thing: it's impossible for any system to generate and check every possible pattern in the data.
Don't believe me? Then let's consider just the variable "Movie Name". Let's assume it's a sixty character string data field. Now consider that a pattern inside that field might involve the first character, but it might also involve the first two characters. Perhaps movies that begin with AZ perform better than movies that begin with AB. Or perhaps the pattern involves the first three characters. Or the first and third character. And so on. It should quickly be apparent that finding all the possible relationships for a single field Movie Name would involve something like 6 vigintillion (look it up) different combinations. Even God does not have time for this.
And, of course, this represents only a small fraction (!) of the possible patterns within a single field. It might be that vowels do better than consonants and the movies with many vowels in their title do better than otherwise expected. Or perhaps personal pronouns or soft phonemes or movies with colors in their title do better. The meta-data possibilities are, quite literally, infinite. Exhaustive examination of the potential patterns is impossible. Not difficult. Impossible.
So here's the thing, somebody created something that created a variable (perhaps an order of presentation or just a row order that happened to be in the data) that happened to call out the significance of the first character of the Movie Name. Because no analytics system that has ever existed or will ever exist discovered it by brute force.
Nor is this a trivial example or a calling out of arcana. While system's capable of massive variable analysis change the role of the analyst when it comes to identifying variables for inclusion, they don't really remove that function and, in some respects, they make it harder. In the old days, we would have just tested the obvious variable - Box Office Performance and been done. Now, we have to comb through our vast reams of data to figure out what should be included. As my example above shows, it's impossible to pick everything. Sure, you can pick every single field you have available to you, but that's nothing like EVERYTHING you have available to you.
Think I'm stretching the point?
Here's a real world example of my own.
A few years back we built a customer segmentation for an online travel aggregator.
The data points we got from them were, by customer, search and transaction detail data. We had the date and time of every search and transaction, the type of search and transaction, the DMA location of the searcher and the search itself, the dollar amounts involved, and the dates of any stay or trip leg.
As we built the segmentation, however, we found that nearly every variable that was interesting and predictive turned out to be a form of meta-data or transformation on the existing fields. The date and time of a search was largely meaningless until it was subtracted from the date and time of the first trip leg to yield a variable called "Days till Search." That variable didn't exist in the initial set. The location of the searcher wasn't too interesting until we paired it with the location of the search, geo-located both, and derived a distance between the two for a variable called "Distance of Trip." The Date and Time of each Trip leg weren't very interesting until, you guessed it, we subtracted the first from the last to get a "Trip Duration." The location of the search itself wasn't too interesting till we categorized destinations by "Business" or "Leisure". The dates of the trip selected weren't too interesting till we grouped searches by Visit and categorized searchers that had "Multiple Days" for a single Destination and "Multiple destinations" for a single day. That last set of categorizations turned out to be the single most interesting variable in the whole analysis.
Almost every interesting variable turned out to require this type of analyst identification. They didn't exist natively in the data and they would not likely have been discovered by turning a massive processing system loose on the underlying variables.
The role of the analyst in identifying important variables is and remains critical to the process. Perhaps someday machine learning systems will have evolved to the point where they can intelligently identify a wide-range of potentially interesting meta-data points. This would be quite close to thinking. Such systems do not exist today. Yes, some types of algorithms might resuscitate by brute force some of the meta-data relationships described above (the classification of destinations, for example), and might even improve on them in some respects.
This, however, is by no means certain.
It is far more likely that they will be drowned out by the impact of variables that are more concisely aggregated. A machine learning system might begin a work out a pattern of significance between origin and destination (for example) that implictily captured the distance factor. On the other hand, it might not or it might only capture it in a few very popular cases. By creating the meta-data distance variable, we greatly magnify the ability of the analysis to model the actual factor.
The exploration of relevant variables has been and remains one of the core functions of an analyst. It is a step in which great art resides and it is the key to good analysis. The advent of systems that can analyze very large numbers of variables might, at first glance, appear to greatly diminish the importance of this step. That they do not is a tribute the vast complexity of the world and the impossibility of exhaustive search or unguided exploration. If machine-learning systems do ever subsume this function it will not be by brute force, but by the creation of processes for data categorization that mimic the intuition of the analyst.
Until then, it is our art which must prevail.
Wonderful post!
An analogy comes to mind. We have wonderful tools and models to do hypothesis testing, but we don't have any great models for generating hypotheses in the first place - at least not from statistics.
Posted by: Michael Whitaker | November 09, 2011 at 08:36 AM
Really enjoyed your posting and I'm flattered that I was able to get a good discussion going on what I find to be a fascinating subject. And,thanks for the kind words about the talk. I agree that there are real-world challenges that are not easily solved by AI or machine assisted learning alone.
Of course there are very good examples of where this technology CAN compliment the analysis process, but it can’t simply replace it. Finding non obvious patterns in other structures of data using AI and machine learning (beyond free text in the case of the example) is both possible and practical in many cases.
In my opinion it is a combination of using machines for what they are good at… calculations across very large sets of information. Also, we should recognize people for what they are good at: hypothesizing, intuition, creativity, validation and investigative instinct. But the example you point out is a key point to be made on why machine learning is not THE final answer, rather possibly a part of the dialogue.
For the foreseeable future, I agree, we humans are still in control.
Posted by: Ryan Caplan | November 15, 2011 at 11:39 AM
Thanks for posting this, Gary - it's very thought-provoking. I agree with you that the role of the "analyst" is more than just asking great questions. I also agree with Ryan's comment above. In fact, when thinking about a standard approach to statistical modeling, there are three areas where the role of the analyst stands out for me:
1. Defining the question.
3. Identifying the meta-data (or semantic relationships) - what you mention above.
5. Selecting the best model - based on the combination of statistical indicators (r2 and VIF in a regression), contextual explainability (is that a word?), and the ability to act on the model's findings.
The other steps are better candidates for automation and ML, IMHO:
2. Collecting (and some cleaning of) the data.
4. Building and refining multiple models.
Posted by: Matthew_wakeman | November 21, 2011 at 04:55 PM
Great post (and blog overall). I am obviously very late to the discussion but I don't think the relevance of the topic has expired since November :-).
Here is another topic that this blog touches on. Despite the limitations of machine analysis that you very eloquently pointed at, many companies today that have large web presence are still using only a small portion of the data that they can potentially tap into, to understand what is going on with the online portion of their business. We are only at the dawn of BigData (despite the fact that for some firms out there it already is afternoon in that respect) but we can already pinpoint at some of the big opportunities that ability to sift through large amounts of data more cheaply than ever are opening. Nevertheless, a large number of companies are still grappling with how to incorporate BigData into their existing organizational architecture (both technical one and human one). While, as you mention above, human is still (and will remain for a foreseeable future) a central point of this architecture, I'd like to argue that a medium to large size company that has a significant portion (or all) of the revenue coming through its online presence, cannot survive without setting up a BigData shop and incorporating it firmly into its organizational structure. One of the big benefits of BigData is the ability to bring together multiple data sources more cheaply (web analytics data, bid management tools data, advertising server data, logged data etc.) and enable deep analytics using this data of the type that no one of the individual tools can do by themselves.
If one accepts the view above (which is relatively easy nowadays) then here are some questions that I think are important answering:
- What are the best practices of introducing the BigData into organization that is a traditional RDBMS and web tools shop (e.g. should it be in technology organization, analytics organization or somewhere else)?
- What is the best way to introduce the traditional analysis currently being done through web tools (Omniture, Google Analytics etc.) with the deep analysis that can be done (what you call "machine analysis") using BigData framework in the organization that is mostly attuned to using traditional tools only?
I am aware that the answer to these questions may be something that is obvious or apparent to many of the readers and posters here but I think that there is a large audience out there that may be interested in hearing some answers to it.
Thanks again for the insightful post.
Posted by: milorad.sucur | December 19, 2011 at 08:27 AM
This is one of the most useful blog posts I've read--thanks!
I'm new to ML and your ideas on the need for wisdom in analysis have helped make sense of how ML would be put to practical use.
Some machine learning (ML) writers seem to take pride in the mistaken belief that ML algorithms only have to be provided with minimal guidance, and they will then magically find great correlations. However, nowhere else in life would one expect such a lackadaisical approach to work well.
FYI, the philosopher Karl Popper provides some additional insight to the learning challenges described in your post. Popper showed that all observations must be preceded by, and guided by, a hypothesis. Popper illustrated this to his college students by telling them, "observe". They would invariably reply, "observe what?" Observation must always be guided by a hypothesis and purpose--for one reason, otherwise, there is too much information to process. A corollary to Popper's principle, here, is that there is no such thing as a completely objective or unbiased observation.
An application of Popper's principle, here, is that ML searches for correlations must be guided by well-crafted hypotheses about where correlations might be found.
Thanks again,
Jim Yuill, PhD
Lockheed Martin Corp.
Posted by: Jim Yuill | November 29, 2012 at 09:56 PM