So even though this is a digression (from my posts on Forecasting and Dashboards) inside a digression (my larger series on re-engineering Voice of Customer at the enterprise level), I wanted to recap and expand on a couple of the central themes of that presentation.
First, a quick table-set. I’ve taken a crack in the past at defining what “big data” is all about. It’s always hard when a concept takes off to keep control over it – the Hype Cycle (Gartner's lovely concept) can take anything – from Presidential candidates to epistemology to piano-playing cats and so over-expose them as to make the underlying reality nearly impossible to discern.
Naturally, this creates a kind of backlash. Plenty of web analytics pundits are more than willing to describe big data as just hype. Not only is there a certain cachet in running counter to a popular trend, there’s a certain self-interest here too. Web analytics companies nearly all come from Web analytics tool backgrounds – said tools being, in most respects, the antithesis of big data tools. I imagine that it’s always safest to assume that anything you don’t understand must be unimportant!
I’m not a fan of the industry standard definition of big data (probably best exemplified by the four Vs: Volume, Variety, Velocity, Veracity) in part because I think it’s vulnerable to the criticism of skeptics that we've always had these exact same factors.
The four Vs do describe most big data situations, they just don’t get to the heart of what big data is all about. Back in the early ‘90s when I was doing credit-card work, we had volume most of today’s big data companies would still consider massive. We had plenty of velocity too, and veracity was pretty darn important when clearing card transactions. We didn’t have variety, but it’s implausible to argue that variety is essential to every big data application. There are plenty of big data applications that are single source. If I’m trying to mine CNN’s digital data stream, I don’t need variety to be in the big data universe.
So were all of us in credit-card working on big data in the early ‘90s? Some might say yes, but I don’t think so.
Instead, I’ve proposed a simpler, more basic, and more fundamental definition of what uniquely defines big data. Big data happens when you drive your data capture and analysis down from the traditional levels of analysis (like customer or transaction) to a level where the meaning of each event can only be interpreted in relationship to the stream of events. Digital is a paradigm case for this. Web site page events are not, in and of themselves, meaningful. The meaningful level of aggregation is somewhere in the sequence of events and that's what you have to interpret.
It’s not too different in utilities. When you move from the once a period reading of a meter per customer to constant collection, you change the nature of the analysis and data capture problem. No single meter reading is, in and of itself, important. It’s in the flow and pattern of the readings that meaning emerges. This is a different type of analysis.
It should also be clear from this why the Four V’s look like a reasonable definition of big data to those in the field. When you drive your unit of analysis down a level, you increase by one or more orders of magnitude the volume of your data capture and the velocity of your data. You place additional demands on data collection that can result in poor data quality. And while variety isn’t necessarily wrapped up in the concept, you have created a whole new set of challenges around joining data that lives at the stream level – making multiple sources (variety) far more difficult to handle.
But the beauty of the definition I’ve provided is that it makes it clear why my ‘90s credit card work – despite hitting the V’s pretty well – wasn’t necessarily big data. No amount of the four V's make for big data if you're just scaling up the same exact types of data and analysis as you've always done. It also explains much of why today’s generation of big data technologies are built the way they are and why they provide unique advantages that traditional transactional systems don’t. Those traditional systems sure-enough handled lots of volume - just not in the ways we need it handled now.
It also explains another aspect of big data that is particularly important and represents one of the biggest risks if you’re building a big data system. The nature of the analysis and the methods necessary to join, process, and understand the data all change at the stream level.
Traditional analysis techniques, from joining methods to sql queries to aggregatations to traditional statistical techniques like correlation, regression and clustering all work differently - if they work at all – when applied to this type of detail, stream data.
I’ve seen this cast as a debate between machine learning and traditional analysis; it isn’t.
Machine learning may (though I think it’s debatable) be particularly useful in big data situations because of the symptoms (the four Vs) that spring from detailed stream-level analysis. As far as I can tell, there is nothing about detailed stream-level analysis itself that makes machine-learning particularly suitable. The really important point isn't about machine learning - it's that your standard analytics toolbox is mostly out the window.So while defining big data by the four V's may miss the mark, it's far, far more misleading to suggest that big-data is just "more of the same" - perhaps with a bit of an emphasis on the "more". If proponents of the more of the same view are claiming that we still need to decide how to structure data, how to join data, how to query data, and how to analyze data then their claim is merely empty. Of course we do. But if they mean to claim that we should use the same methods to join, structure, query and analyze the data as we always have in traditional transactional or BI systems, then they are flat-out wrong.
It's nearly always a safe bet that opposing the hype cycle will make you at least half-right. But if the big data skeptics are half-right about what's wrong with the hype, they are wholly wrong about the alternative.
[I should mention that I'm going to be speaking on a Big Data panel at the DAA Symposium in Washington DC on June 4th. The Symposiums are the probably the single best thing the DAA has created - I've attended a fair number and been consistently impressed. If you're in the area, do come out!]