A Different Take on the AOL Search Data
I thought I’d take a quick break from Functionalism (I’ll have a new web post this weekend) to quickly lay out some thoughts on the recent AOL search data madness. As a disclaimer, though AOL has been a client of ours, we’ve never worked for the Search Group and weren’t involved in this in any way – so my thoughts are an outsiders, based solely on the various articles I read…
The firestorm of publicity surrounding the release of AOL search data underscores the risks that companies take in keeping and analyzing data about user behavior. An incident like this will have repercussions that most consumers and journalists never hear about, think about or, probably, care about. It will make it that much harder for managers in any organization to collect, analyze and deploy solutions based on tracking and understanding consumer behavior – no matter how reasonable the use or beneficial the application.
For most people, that may seem like a good thing. But is it?
There’s a clear difference between the AOL case and other recent "Data Loss" cases. I imagine that everyone agrees that when a company or governmental agency loses private data, has it stolen, or exposes it on the internet that this is a bad thing. But that isn’t what happened here. AOL released data scrubbed of personal identity so that researchers could better understand the way searchers actually worked. This is an important usability issue on the Internet – and it’s a complicated problem on which many independent researchers might well be able to make a significant contribution. So, in many ways, what AOL did was laudable – not a case of abuse but legitimate good use.
Now I think users probably realize and expect that companies will do this kind of thing internally. There is no significant web enterprise running today that doesn’t analyze both clickstream and search data to better understand user behavior on their site. The U.S. government spends millions of dollars a year collecting and publishing data on a much wider range of activities – and while that data is also scrubbed and aggregated it’s often quite possible to re-construct personal information. So the collective outrage about AOL’s decision to release this data reflects a deep misunderstanding of what the data is and how it can be used to benefit everyone from consumers to businesses.
Is there anything questionable in the AOL release? For thoughtful commentators, the biggest concern has been that even scrubbed data often has pointers to particular people. That’s true, and the fact that the search data is non-aggregated makes it potentially more vulnerable. But it’s also fair to suggest that actually identifying anyone is extremely difficult (does anyone search "My Name is Gary Angel"?) and that the use of this data was not actually going to harm anyone.
OK, I know that that people were able to identify at least one woman from the data. That's not a good thing - but I guess I thought we were in the middle of an identity fraud epidemic! How is this such big news? Here's a litmus test for you:
There is a simple, easy and nearly foolproof way to clean-up this data. Suppose that AOL had used a unique integer to thread the data and then replaced any search term that appeared in fewer than five different threads with a code indicating that it was an "exotic" search. It's not like researchers need to know that some idiot in Paducah entered a search string like "my ssn is 818-111-1112!"
It's very difficult to imagine (but again, not impossible) that the same personal identifying information would appear in five different threads. That would make you wonder how private it is to begin with!
Suppose AOL had done this? Would you feel differently about it? Even more germane - do you think it would have made the slightest bit of difference to the way the story was reported or the "outrage" that was manufactured or the jobs at AOL that were lost? I don't think so.
In a world of very real security and personal identity concerns, this one seems banal.
And the real problem is that for Search Engines and Web Sites to make themselves better, this is precisely the kind of thing they have to study and even use. Web sites have improved their usability and functionality dramatically in the past five years. And features like Amazon’s "Suggestions" are accepted as de facto best practice. That doesn’t happen by magic. It happens because companies study how users behave on their web site and try to figure out how to make it better.
Would it surprise people to know that department stores and groceries do the same thing? That retailers analyze how visitors flow through their store and group the products they purchase into "baskets" that tell them what items are purchased together and should be adjacent on shelves?
Should the data AOL released have been kept under corporate control? I’d say no – I think it would be better if more companies shared their de-personalized and/or aggregated data. Though I do think they should have cleaned up the "onsies" using somthing like method above.
Worse, the reaction has been so disproportionate to the decision that it raises the stakes in the minds of web marketers everywhere. It won’t protect indivdual’s personal data. It may even divert resources from meaningful efforts to protect that data. However, it will probably insure that your web experiences are less productive than they really should be. In the end, it’s hard to believe that these occasional paroxysms of public hysteria about data privacy are really good for anybody. The protections they encourage are usually knee-jerk reactions - poorly considered and probably unproductive.
The problem with the cases like the AOL data release are that web marketers are just as likely to react irrationally as anyone else. Here’s the choice they can make – do some form of data analysis to make their site better and get a small profit increase or risk their job by ending up in the public spotlight because somehow what they were doing became a "story." Nobody wants to lose their job!
There’s a big difference between a company that exposes or releases data of direct personal consequence out of carelessness or cupidity and one that releases scrubbed non-personal data in a legitimate attempt to improve their service. Pretending that the difference doesn’t exist makes it harder for everyone – the people that are charged to protect your data and the people who need it use it.
Comments