There is no problem as consistent, ubiquitous and challenging in the life of an analyst as bad data. As one member of my Huddle at X Change remarked, all data is bad; it's just a question of degree. I think that's true, but it's also true that the degrees matter. Some data is too poor to squeeze any value from. Making lemonade from lemons is all very well, but what do you do if you're lemons are rotten? Understanding the degree to which your data is bad, the respects in which it might be useful or flawed, and knowing the techniques both for isolating problems and taking advantage of good data are all essential elements of the analysts toolkit.
Due to the unexpected, late cancellation by a Huddle Leader, I found myself leading a Huddle on "Dealing with Less than Perfect Data" - a duty for which I had little time to prepare but PLENTY of practical experience (as who among us does not)?
I had a simple plan for the discussion: start with some "horror" stories that people had experienced, from these extract some common themes about what causes bad data. If an analyst is sensitive to the causes of bad data, they are more likely to be attuned to it and recognize it before they use it. From there, I wanted to cover techniques for identifying when your data is bad. What are common warning signals and tests you can use to see if some piece of your data is interesting or simply erroneous? Finally, I wanted to cover techniques for working with bad data including some discussion of trending and when it's appropriate/inappropriate as a method of handling data problems.
As we did the initial "ice-breaker", however, several people also mentioned problems in communication - how to talk about data quality. I added that to the end of the list since it seemed like a good closer.
It's probably no surprise that our group (rich in actual practitioners) had plenty of horror stories. You don't have to be a Web analyst very long to have a rich fund of these stories. As we dissected them, several common types of problems emerged. Problems in what I would call Meta-Data Maintenance (SAINT tables out-of-date, CMS files not updated, etc.) were extremely common. Tagging of dynamic data and interactive experiences were another frequent culprit for bad data. Several horror stories were driven by breakages in teams - where a new resource or team was plugged into a process very late or without adequate training or experience in measurement. Multiple systems demanding reconciliations also showed up in several accounts. Reconciliations are a bear and can be one of the most frustrating tasks an analyst faces. Untracked changes were another big source of headaches: new pages, new campaigns, site modifications that nobody knew about or had ever heard of. We've all been the victims of lack of information.
For an analyst, kvetching about data quality can be as therapeutic as complaining about an ex (spouse/girlfriend/fill-in-the-blank). But I wasn't shooting for therapy. For an analyst, the key step is to distill those horror stories into a deeper sense of when you're data is likely to betray you. You should know that when the technical team is swapped out midway through an implementation, you're chances of clean data take a significant hit, that data from Flash applications needs be much more carefully vetted than HTML pages, and that if you aren't being told about site changes, then sudden drops in site traffic are more likely to be measurement artifacts than real-world phenomena.
Certainly one of the big process recommendations that almost everyone in the Huddle shared is to make sure that the measurement teams are "in-the-loop" when it comes to both marketing and site changes. Every organization should make sure that ALL web stakeholders know when new campaigns are launched, have a complete list of every online marketing effort readily accessible, and know the release schedules for ANY change.
Knowing whether you are being "paranoid enough" about your data is the first step in controlling bad data. The second step is even more important and probably more challenging - figuring out when your data is actually bad.
Any good analyst will regularly apply the "sniff" test to identify bad data. But while we all know what the "sniff" test means when it comes to sour milk, it's not so obvious how it applies to data. Here's a couple of common sniff tests our group discussed. The first and most obvious is the zero case. If you're getting no data on a campaign or a page or conversion, then something is broken. A parallel though less common case is the 100% case. If your funnel is showing 100% conversion or a page has 100% click-through or your month over change is exactly zero, then your measurement is broken. Perfect failure and perfect success are equally unlikely in this world of ours. Dramatic changes, too, are a common sniff test - particularly when they exceed measures of variability. Site traffic down 50% in the last week with no known cause? Measurement is more likely as a culprit than a dramatic change in your business.
Trusting your stakeholders was another important lesson that was discussed when it came to finding bad data. The sniff test is, ultimately, a business faculty. You have to both know about and care about the business to make valid "sniff" test judgments in cases where the data problem isn't absolutely obvious. That's one of the main reasons why offshoring report generation and analysis can be so dangerous. In my experience (and I think this was widely shared in the group), offshore resources simply aren't sensitive to bad data.
On the other hand, your stakeholders often know the business extremely well. If numbers don't feel right to them, you should take careful notice. It's a mistake to simply insist that it's "what the numbers show." When stakeholders believe data might be bad, it's ALWAYS a good idea to give it careful consideration. A stubborn analyst is every bit as dangerous as a stubborn executive.
Okay, but what do you do when data smells a little bit off but might still be right? How can analyst check to see if data is bad? After all, the line between interesting data and bad data can be extremely fine.
The group had a whole host of ideas about this. Many of these ideas boiled down to a single, essential technique - segmentation. By segmenting along dimensions like time, browser, geography, and visit number, you can quickly see whether data issues are driven by particular factors. If conversion rates are zero for a specific browser type, you've likely identified the source of your problem. By segmenting, you're pushing the data into a more granular form where you may be able to discover an absolute (0 or 100%) problem.
But while segmentation is the analyst's closest friend when it comes to investigating data, it may not be the friend you talk to first. Many in our group suggested that the first step when you suspect bad data is to resort to the site and another analyst buddy, the HTTP debugger. You can use Charles or Firebug or whatever tool you like, but if you suspect bad data from a Website, it NEVER hurts to run through the actual pages or tool and see what happens when the tag fires. Too many analysts are unwilling or unable to do this simple unit testing of tags and it's a fatal deficiency when it comes to data analysis. The broad consensus in our group was that it's the FIRST avenue of exploration when bad data seems likely. If you can't find anything wrong on a basic tag inspection, then it's time to start segmenting your data.
Here's another simple yet oft-ignored tactic - ask around. It's surprising how often people may know of significant changes or factors that may driving your data. Getting analysts to "ask around" can be harder than you think, but it's critical if they're to avoid spinning their wheels on data problems that could be readily explained with a little extra knowledge.
And remember that list of site changes / campaigns I talked about earlier? Well, it can come in handy here. Matching your data trends to that list can be illuminating. Did traffic drop suddenly in the hours after you pushed a "maintenance" release? When data changes tie to significant site release or marketing events, you can be pretty confident that you're hot the heels of the true culprit.
So what if the data is bad, are you stuck? Not really. As I started off, all data is bad to some degree. That doesn't give license to use any data at all. Some data is too faulty to be used. Is the level of variation in your data larger than the relationships in the data you're finding? If so, then you probably can't use the data. In general, we agreed that understanding the level of natural variation in your data as well as the extent of bad data is essential before deciding if you can use the data in any given fashion.
This is hardly cut and dried. You may be able, for example, to use your Web analytics data for Repeat Visitors but not your data for New Visitors. Repeat Visitors are almost always repeat visitors, though not the full set of repeat visitors. "New" Visitors, on the other hand, are often returning to the site. Understanding the difference in quality is critical in understanding how you can use a metric - even a single metric like Visit Number. It may be much more usable for some values than others.
We also took up the topic of "trending" as a protection against data quality ideas. As I've written before, the assumption that trending is a protection against data quality issues is predicated on the presumption that data quality issues are distributed randomly. That may be the case, but it often isn't. A "trend" toward more first time visitors, for example, is quite as likely to be a measurement artifact of the shift to browsers that block 3rd Party cookies as it is to be a true site phenomenon. Trends can just as easily be evidence of data quality problems as protection against them.
Which brings me, finally, to the question of communication. Given that all data is at least somewhat bad and some data is too bad to use, how do we communicate potential issues around data quality without paralyzing the web analytics effort?
This probably could have been a Huddle topic unto itself. I've summarized the discussion everywhere and left off a great deal that what is valuable, but nowhere must I summarize more than here.
I thought one of the most important (and surprising) insights, was that an analyst had to be careful not to focus on data quality issues when first working with a team. In talking about this, many in the group felt that an analyst had to show that data could be used before given license to abuse it (verbally). We had some true hard-core practitioners in our group, and I think their advice is relevant not just to Web analysts but to ANY new member of a team.
Demonstrate your value before you start criticizing. Believe me, it makes a difference.
I shared one example of this from our practice. At Semphonic we almost always include a set of data quality and measurement infrastructure improvements in our analysis presentations. We put them at the end of deck (we start with Findings & Recommendations). So before we put in any data complaints or suggestions, we demonstrate real value from using the data. I think this goes a long way towards making the criticisms both credible and palatable.
Another key point was the vulnerability of trust in stakeholders. Presenting one obviously bad number can ruin an entire report or even an entire measurement effort. It doesn't matter if the number isn't your fault or isn't in your control. I've seen reports wrecked by the inclusion of, for example, panel data that simply wasn't reasonable for a given business. When stakeholders see one metric they know is wrong, they lose confidence in all the rest of the data. It's too late to start explaining the difference between panel data, opinion research data, and Web analytics data. As an analyst, your job is to understand that difference before you present and to decide whether the panel data (in this instance) is reflective of the actual business. If it isn't, it shouldn't have been in the report from the start. You can blame executives for not understanding these explanations, but when you've put obviously incorrect data in front of someone, the loss of their confidence is both deserved and difficult to repair. Believe me, we've all made this kind of mistake; I've know I've done it more often than I'd like to admit.
All in all, it was a great discussion of the sort that is absolutely unique to X Change. Mix together a roomful of real practitioners, give them a topic to sink their teeth into, and you can hardly help having a great discussion. The group helped crystallize some of my thinking around data quality control from an analytical perspective, and I hope that my brief summary of the discussion will be equally beneficial.
We all live with less than perfect data. Sometimes we can find it, sometimes we can fix it, sometimes we can even use it. But as any analyst knows full well, it's all too easy to be deceived, to miss what's really important or to "find" what isn't there. By knowing where data quality issues are most likely to occur, understanding key sniff tests, knowing how to check data quality, and knowing when and when not to use bad data, we have a shot at finding something close to the truth!
Our biggest challenge with data integrity is lack of proper testing when data points are either introduced or maintenance is done to a page/functionality. I stress repeatedly (with limited success) the importance of testing. and it needs to be proper testing done by someone who understands what it is they are doing, not interns or temp help with little to no competent direction. I've seen far too many 'testers' assume that if data is showing up then it's working; visible data DOES NOT equal correct data. Just prior to reading this email I was asked for some data on some promo spots on our site, which after a cursory glance at Omnibug showed that each clickthru was being recorded twice. Easy to see if it had been tested beforehand instead of after the fact. Now I'm in the situation of trying to figure out if this data is at all usable: can I simply cut the numbers in half? When did this data duplication start? Is it happening in isolated enough instances where it won't distort the results enough to matter since the question being asked by the Marketing group is a very high level trending with no real definition of success measures?
It is so much easier and time efficient to test before than try and find and fix later.
Posted by: Cleve Young | October 03, 2011 at 08:18 AM
Cleve,
Great point.Testing in Web Analytics is a real challenge - and one that it's hard to find a single good solution too. This is particularly problematic in tagging solutions (which is pretty much all any of us use these days) since by the time you find a problem during an analysis, the data is gone. I think you're right that testing is not the domain of automatons, though I think there is a role for automation. We've seen (and tried) everything from tools like Observepoint, to using the Data Feed to setup regression tests, to formal testing scripts and semi-automated tag capture. Can't say I have a clear winner. I do think it's important to train implementers in basic unit testing and it's important, beyond that, to have a comprehensive testing strategy that uses one or more of these types of techniques.
BTW - the double firing tag is such a common problem that I wonder if the vendors shouldn't trap it (I don't think it would hard). That would sure be nice. I'm fairly certain I could write a simple script that detected that condition in the data feed.
Anyway, thanks for the great comment!
Gary
Posted by: Gary Angel | October 04, 2011 at 01:24 PM