I got several emails in the last few days asking my opinion of a recent Quantivo blog concerning a post from a couple of months back by Avinash on “10 Fundamental Truths About Web Analytics.” I hadn't read either till I got the emails and I ended up reading them backwards (critique first) which isn't always ideal.
I don’t suppose my overall opinion on the issue in question (is data warehousing web analytics data useful?) is any mystery since I’ve been a long and consistent advocate of data warehousing behavioral data. But I think there’s enough substance in the exchange to warrant some thought.
Read both posts and I think you'll agree that, at least tonally, Quantivo may have missed some of the sly humor inherent in Avinash’s writing. When he writes a title like “10 Fundamental Truths” what he generally means is “The 10 Most Controversial and Entertaining things I can think of that have at least a grain of truth in them and will make you think a bit.”
But if he titled it that way, what would be the fun?
Reading Quantivo's post, I don’t think they quite get the joke. When your main goal is to be interesting and thought-provoking, it’s hard to be simultaneously nuanced and thoughtful. You don’t get Bill O’Reilly and Robert Stavins in one package.
Now it’s probably no surprise that the “truth” that Quantivo takes most exception to is #7: “A majority of web analytics data warehousing efforts fail. Miserably."
It might seem that Quantivo would have little to complain about in this thesis since they make (and admit to) a pretty similar claim. After all, one of the great virtues of the Quantivo solution is that you aren’t mired in the fearful prospect of any large IT project using old-world warehousing technology.
But what got the guys at Quantivo upset isn’t so much the headline as the sub-arguments Avinash provides for why warehousing online data is a bad idea. They can be briefly summed up as:
- There is too much data
- The data is thin and inaccurate
- There is little structure or relation in the data
- It’s hard to integrate with offline data even if you have the keys
- BI Tools are worse than web analytics tools for clickstream data
- Web sites are too dynamic to be managed effectively at the warehouse level
Here’s the full elaboration of that first claim:
“There is too much granular data! Yes yes I have purchased the Netezza appliance, yes other promise "massively parallel processing data warehousing appliances". The problem is not the hardware or the hardware company, the problem is the amount and type of data (most of it is actually worthless, even if you can get much of it into the warehouse). Things of course get worse when you think of warehousing in traditional software only solutions.”
Avinash seems to be arguing a simple performance issue since he says that using traditional software warehousing solutions (as opposed to an appliance like Netezza) make the problem worse. But the argument here is more than a bit of a muddle because the real gist seems to be that most of the data is worthless (which is repeated in his second bullet-point – the data is thin).
This seems contradictory. If most of the data is worthless, why not throw most of it away and then warehouse the good stuff? Then you’d have the worthwhile data and you’d have a lot less of it to deal with. If too much data is a barrier to warehousing and most of the data isn’t useful it seems like you could kill two birds with one bullet. Perhaps Avinash thinks it would be impossible to tell which data is useful and which is useless but he nowhere makes that claim and, in fact, it would be a hard claim to take seriously.
More likely he thinks that’s what Google Analytics has already done (take the good stuff and get rid of the bad) – and some of his later arguments seem to suggest that’s ultimately what he’s getting at.
The Quantivo guys sure don’t think that’s right. Here’s their take:
“…unwavering focus on simple metrics and aggregated data is the prime source of inaction and analysis paralysis in companies today. Businesses need INSIGHTS not more high-level statistics or dashboards. Yes, metrics are required to measure progress, but they do little or nothing to tell you HOW to move the needle in your business”
I'm with them on this one, of course. Aggregations always sacrifice analytic opportunities and the extreme aggregations in tools like GA are definitely no exception to this rule.
Quantivo also takes strong exception to his third point – that web analytics data lacks structure. Here, I think they’re only partially right. There’s some structure in web analytics data (as they point out visits and visitors are a type of structure – and, in fact, the Quantivo solution does a much better job of providing interesting structure to online data than their blog post might indicate). But Avinash is partly right here too.
The most interesting structures in online behavioral data aren’t obviously resident in the data. I’ve written (here and here and here and here) and I’ve taught (I do a Think Tank course on warehousing analytics) that if you simply load event-level web behavioral data into your warehouse then you’re going to have significant problems. And you’ll have those problems for many of the reasons that Avinash enumerates – the data structures aren’t those that are interesting to the business, the data is large and makes creating on-the-fly structures nearly impossible and BI tools don’t handle streamed records very well. These are all legitimate concerns that highlight the necessity of building a good data model when warehousing online data - no matter what technology you use.
But just because it’s possible to do something badly doesn't mean you shouldn’t do it at all. Start thinking that way and you'll end up being more like a plant than a person.
What about the other points?
I’m not sure what to make of Avinash's “data is thin and inaccurate” claim. It’s unclear to me why thin and inaccurate data somehow becomes useful when aggregated in tools like GA. Does it get thicker and more accurate? Perhaps it’s like baking a cake? And if GA can aggregate the data in ways that are useful, why can’t a warehouse be used to deliver different aggregations that are also useful? Isn’t GA really just a nice interface into a single, generic warehouse?
When you get right down to it, the debate isn’t about warehousing data. Every web analytics tool does that. The real debate is about customizing and building your own warehouse or using a generic, vendor provided solution with a narrow and focused interface.
This makes the next point about the challenge of offline integration particularly important - since merging online with offline data is one of the obvious and immediate benefits of building your own solution. Avinash writes, “It is worse than extracting all your teeth with a toothpick to try and get your offline data merged with your online data (even if, and it is a BIG IF, you can get the requisite primary keys).”
This is a great polemic and while I'd love to see O'Reilly pound a table and shout "it's like pulling teeth with a toothpick!" I have to admit that I think the polemic gets in the way of the point.
It’s true about the BIG IF. That is a big if. But if you have requisite primary keys why is it still hard to do the join? I’ve done those sorts of joins countless times and I can assure you that when you have the requisite keys it’s pretty much trivial.
What about the idea that BI tools are worse than web analytics tools for handling clickstream data? Well, once again, this is a mixture of insight and confusion. It’s a fact that most BI tools don’t handle stream data very well – they weren’t designed for it and they often provide abysmal performance and poor access paths to streams of data. Here’s the funny thing…the same is true for the web analytics tools we all use. Web analytics vendors take the stream data and aggregate it into cubes before they expose it to us. All the interesting work is done by highly-optimized stream processors. These processors de-stream the data and then build aggregations from it. So when you use GA or SiteCatalyst, you’re using cubed data (except for certain special cases like Omniture’s Full-Pathing tool).
In a good warehouse, you’ll need the stream data for building views and for certain types of analysis. It's the real beauty of systems like Netezza that they have the horsepower to build these aggregations inside the warehouse. But you’ll do the vast majority of your analysis work at an aggregate level. The aggregate level you choose is likely to be much more detailed than what’s in the web analytics cubes that vendors provide – but you’ll almost certainly have de-streamed the data. Or, of course, you could also consider a solution like Truviso that’s specially designed to process streams.
Yes, BI tools do have a tough time with pathing and that can be a real issue – not that it’s impossible but it’s definitely not as convenient as using a really good pathing tool. But web analytics tools – especially one’s like GA – miss out on a whole range of visitor-level aggregations and statistical analysis techniques (including even very basic ones) that are almost always more interesting and more important than page-level pathing.
There is a whole range of nuance here that makes all the difference in the world. Traditional BI, DB and Stats tools don't do everything that web analytics tools do. But web analytics tools are missing a HUGE amount of what those tools do - and much of it is clearly relevant to the analysis of online behavior.
Which brings us to the last claim - web sites are too dynamic to manage the meta-data effectively in a warehouse environment:“Campaigns, tags, links, meta data (if any that might exist), data relationships, metrics, website url structures etc cause there to be a constant demand to make changes to the underlying structure of your data warehouse every single day. Yet no dw team is organized to execute on a daily schedule, you'll be lucky to get monthly. All of the aforementioned is not a problem for your web analytics tools.”
Hmm. Why isn’t this a problem for web analytics tools? It sure seems like it is to me. And anyway, isn’t the most common way to get online data into a warehouse to use a web analytics solution to collect it? So if the web analytics solution is doing all that stuff before you generate a feed, why is it still a problem? And what exactly do web analytics solutions do about things like links and campaigns that any warehouse structure can’t do at least as easily? Besides, I thought an earlier claim said there weren’t a bunch of data relationships in web analytics data? What metrics can you add to your web analytics solution if it doesn’t already support them? And do all web analytic tools provide the ability to easily upload your meta-data and join it to tag-collected data?
I can’t make head-nor-tail of this argument.
Web analytics tools are among the least flexible, least powerful data manipulation tools available. Even the true enterprise tools provide extremely limited ability to add from additional data sources, transform data, delete bad data, join data, and update or maintain meta-data. If web analytics tools have any maintenance advantage over a warehouse it’s that they do so little that you don’t have to worry much about it. And GA is the absolute leader of the pack in this regard.
Yes, maintenance is indeed an issue but try as I might I can’t unearth a real argument against warehousing here.
So what this all boils down to is that no one really disputes the “fundamental truth” that lots of data warehousing efforts do fail. You don’t even need to put the web analytics in front of data warehousing for the claim to be true. You don’t even need to put the words data warehousing in. Lots of IT efforts fail. Lots of business efforts fail. Lots of all sorts of things fail.
But this failure is neither inherent in the project nor in any way shape or form some fundamental truth about web analytics. When you look closely at the claims about why warehousing is a poor idea, you see why the story is so confused.
Many of the points are perfectly valid. There is a lot of data. Click stream data is anonymous and error prone. A useful data model for clickstream is non-obvious and many warehousing efforts don’t pay adequate attention to this problem. It’s often hard to get join keys into your online data stream. BI tools are poor at handling streams of data. Web sites are dynamic and do make governance challenging.
But every single one of these issues are a challenge to existing web analytics solutions – which, as I’ve pointed out above – are just generic, vendor provided warehousing solutions often built on worse technology than what you can easily purchase. And there is simply no reason to think that an organization willing to invest in good technology (of which there is an abundance including solutions like Quantivo or Netezza) and in the expertise necessary to effectively model their data can’t dramatically raise the bar over what’s provided in those generic tools.
Ultimately, I don’t think anyone could take seriously the claim that web analytics data is so unique that tools like SQL, SAS, R, and a host of other database, BI and statistical analysis tools won’t provide value.
Like so many “fundamental truths,” the claim that warehousing analytics data is a bad idea is more thought-provoking than action-guiding – more polemic than policy. There are plenty of useful warnings here about what can go wrong and why data warehousing isn’t easy. What’s missing is an understanding of how to solve those problems, a recognition that a range of intellectual and technical solutions do exist to solve these problems, and a realization that the majority of the problems described also exist in dramatically worse fashion in the existing web analytics solutions being advocated as the better alternative.
Thank you Gary,
Great and reasonable account of both point of views. Extreme and blurry or confusing claims are often smoke and mirrors disguise of overtly simplified trick and pony show or blatant commercial interest. The danger lies in people taking those claims for obvious and universal truth as if they came from God himself.
Your analysis brings excellent points. In fact, anywhere from 60 to 90% of IT-related project fails (depending on the source) and this is obviously true of BI projects. But I wouldn't be surprised it's the same of web analytics projects!
Posted by: Stéphane Hamel | June 06, 2010 at 11:09 PM
This post is great is so many ways. Thanks!
Posted by: Jonathan Mendez | June 07, 2010 at 06:33 AM
This great post reminded me when I first started reading analytics for a car transportation company website, the auto shipping network.
Long way we have come now!
Thanks for keeping it fresh!
Posted by: Tim White | June 25, 2010 at 12:30 PM
Like Tim with his car moving analytics, I started with an online radio station, using valuable tools for reading real-time traffic and listeners for the site. I've always tried to use every single aspect of these tools in order to achieve my goals and nothing has failed me yet. People need to trust these tools far more. They do work, they're great to use, and can get you a long way.
Great article.
Posted by: Paul Martinez | June 30, 2010 at 09:21 AM