Sampling is one of the core techniques in primary research. In traditional opinion research, the sample is the thing. It's not just that it's critical to have a valid sample to have useful research; it's also that it's surprisingly difficult to get a good sample. But for Social Media measurement, most people don't think of "sampling" as a part of the problem. Social Media is generally considered to be more like Web analytics where we analyze the complete set of ALL behaviors. After all, tools like Radian6 and BuzzMetrics are designed to capture everything, aren't they?
Well no, actually they aren't.
Sampling is surprisingly important at several levels of Social Media Measurment and limitations on samples have a powerful impact on what types of analysis are appropriate with social data. So it may be no surprise that the role of sampling in Social Media was one of the most interesting parts of my panel this past week at eMetrics.
Let's start at the top level and work down. Listening tools in social media measurement DON'T collect everything. Collecting everything is impossible. Every tool vendor makes a distinct set of decisions about what (and how often) to collect. They decide which sites to scan and how often to scan them. They also make decisions about what data to collect. Not every vendor, for example, will collect comments on blog posts. Some do, some don't. All vendors are further limited by the closed nature of communities like Facebook that restrict what can be collected.
So at the very top of the social measurement process there's a sample. It's a sample that's not designed to be representative. Instead, it's designed to be as comprehensive as is practical. Most vendors choose to collect what they view as the largest feasible collection of information. For some sources, this comes very close to being comprehensive, but for other sources, not so much.
There's really not a whole lot you can do about this type of sampling except to determine whether the differences by vendor are important to you or not. Still, it's useful to be aware that you aren't starting with either a comprehensive collection or a representative sample when it comes to social measurement. This has real implications to how you can use the data and how you should think about your findings.
What about the next step in the process of social media measurement - the step we describe as "culling"? In the culling phase, you subset your data to find the verbatims of interest to you. In most listening tools on the market, you do this by creating keyword profiles that select any verbatims that match the keyword logic you specify (using a combination of Boolean and Text Operators like "Near").
This approach holds true EVEN for the vast majority of machine-learning tools. These tools can classify (taxonomically and by sentiment) your verbatims, but they work on a subset of the verbatims that is chosen using keyword profiles.
As with the top of the funnel (collection by the vendor), this subset is usually chosen to be comprehensive not representative. However, the way you configure this subset will have profound implications on all of your subsequent research.
Suppose, for example, that you create a profile based on every case where your "brand" is mentioned. If you then try and use this profile to do product/feature research, you have a subset biased heavily toward your products. That may be fine or it may be fatal depending on the type of analysis you want to do.
If you want to do competitive analysis, you need to be sure that you setup EVERY competitor in exactly the same way. If you build a rich profile containing your product names and sub-brands, then you need to match that profile exactly for your competitors. If you don't, you've biased all your results with a poor sample.
Or imagine that you want to understand the share of mentions by key topics such as Customer Support, Price Comparisons and Feature Mentions. If you've eliminated a significant number of Price Comparisons by setting up exclusionary rules to weed out "sales" posts, you've biased your sample.
What's particularly tricky about bias at this level is that there is virtually no way to detect it - particularly when the bias, as in my last example, exclusionary. The data simply never shows up in the reports.
The vast majority of profiles that we look at in Social Media Measurement introduce significant biases in the subsets they create. Doing so is nearly unavoidable. What's far more worrisome is that almost no one who is using the data understands what's been done or what the implications are.
If you're using machine classification and sentiment analysis, you may be done with sampling after these two levels. On the other hand, if you are using human readers for sentiment analysis and classification, or for the isolation of key verbatims, you've got at least one more sampling problem before you're done.
Many organizations, having realized that the sentiment analysis contained in keyword-based listening systems is somewhat worse than useless, have opted for listening agencies that use human readers to classify sentiment. I thought one of the most surprising aspects of our panel was that both Michael and Christopher were deeply skeptical of the quality (not just the scalability) of this approach. Their objections were concentrated on the problem of human interpretation and the difficulties in achieving consistent sentiment analysis with human readers.
But there's also a sampling issue here. It's impractical for most large enterprises to pay for the reading of every verbatim included by the profiles creating in the culling process. If your volume of social chatter is small enough, it may not be an issue, but for a large or socially-oriented brand, comprehensive readership is impossible. So you have to sample the subset produced by "culling".
This sample raises its own set of issues, because it is fully intended to be a representative sample.
But here's the question: representative of what? If your subset contains 10,000 verbatims in a month, it seems that by taking 1 in every 10 you could create a representative sample of verbatims. And so you could. But let's suppose that your 10,000 verbatims represented the following:
9,100 Twitter Mentions
800 Blog Mentions
100 Press Mentions
With a 1 in 10 sample, you'd likely end up with something close to the following:
910 Tweets
80 Blogs
10 Press Mentions
I've sampled everything perfectly, but I've reduced the size of my source populations so that I can no longer draw any conclusions about Press Mentions and only very shaky conclusions about Blogs.
This implies that if I want to understand sentiment by source, I should oversample less common sources to make sure that I have sufficient volume for analysis. Unfortunately, the likely set of my interests doesn't end with source. If I want to understand sentiment by influencer level, for example, I have a different population I need to oversample for.
What's more, if I keep drawing samples for every report and then trending them, sooner or later I'm going to get a bad sample. Suppose I have a 95% confidence that my sample will be +-5% of the real number. If I'm pulling a weekly sample, there's a good chance that sometime during the year my sample is going to be significantly off - creating either alarm bells or misguided back-patting. Oh, and there's no immediate way to know that the sample is off unless you repeat the whole process several times.
In short, human readership MAY add to the quality of sentiment analysis even as it lessens the quality of the reporting. The cost/benefit of the trade-off is likely to be determined by the degree to which human readership forces sampling and the extent to which an organization needs to slice their data by sub-categories. Since nearly every meaningful report or analysis involves sub-categories, I suspect that human readership when it demands sampling - is a poor solution for social measurement.
I'll have more to say in future posts about the whole idea of sentiment analysis. I'm not convinced that social media measurement is the proper channel for measuring either brand awareness or brand sentiment. Much of the reason for my skepticism comes down to the fact that Social Media measurement isn't based on a valid sample at any level. This doesn't mean Social Media measurement isn't interesting or important. It does mean that it can't fulfill every function equally and of the functions that are most problematic, brand sentiment may be tops on the list!
Many thanks to Michael Healy and Christopher Berry for their thoughts at the panel (and Marshall Sponder as well since he and I talked on Friday). We had a great turnout - which was nice to see - and the discussion was lively and interesting.
Unfortunately, my own sample of the Conference was a poor one. I was just crushed by meetings on Wednesday when I was speaking at eMetrics, missed Thursday, and found the IMC Conference pretty much DOA on Friday (I'm thinking that particular Sub-Conference might need a bullet in the head to put it out of its misery). On the plus side, I'll be back on the East Coast in a couple of week and there's still plenty of time to register for the WAA Symposium in Philadelphia - should be a terrific event - and I'm looking forward to my "noveau" panel!
Gary,
Marshall Sponder sent me your post because we had had a discussion about sampling and data accuracy. While I agree an organization needs to be pay attention to not unintentionally skew their analysis when analyzing a subset of data, I disagree that all monitoring tools are unable to create a robust and precise sample of data. Yes, if you rely strictly on using keyword or boolean expressions then the process to exclude/include ontopic conversations becomes quite brittle and ineffective. However, at Collective Intellect, we rely on semantic technology to create robust filters to collect and organize our sample datasets. In other words, our technology does not rely on someone knowing every combination or variation on a term; our engine is able to contextually understand, recognizing the difference between crocs (the reptile) and Crocs (the shoes). We've tested our categorization accuracy and have achieved quite precise results. You can read more here http://www.collectiveintellect.com/blog/social-crm-starts-with-categorization , if you are interested.
Thanks for posting about this topic, it's a good one. If you'd like to chat more, please drop me a line.
Posted by: Jennifer | October 31, 2011 at 11:14 AM
Jennifer,
We've actually used CI with some of our clients and while I have some reservations about the flexibility of the classification capabilities it's certainly better than boolean keyword based systems.
I'm not sure, however, that even extremely precise categorization will solve all the sampling issues I discussed. It's extremely easy to bias a sample even with perfect categorization - simply by missing some concept that's also relevant. Nor does this really address sampling issues at the sourcing level. I remain unconvinced that the proper function of Social Media Measurement is to create an accurate sample for customer research purposes (of the sort implied by a commitment to brand sentiment tracking) and very skeptical that if such is the intent that it is actually possible.
Of course, some of my remarks on Samples were actually directed more toward systems that rely on human readership for classification and I'm guessing you'd be more inclined to agree with me here. If (and I assume you don't) you believe that human readership is necessary to accurate classification then it stands to reason that you can't use a machine-learning system to build your sample. If the machine-learning system builds an accurate sample, there would be no need for a human reader! In point of fact, most human-reader systems do use a sample method based on keywords, boolean logic or other method (such as influence) which significantly distort the population and may sacrifice any of the attendant benefits of improved classification...
Gary
Posted by: Gary Angel | October 31, 2011 at 05:10 PM