Overview
For many years, marketing professionals have relied on a set of analysis techniques designed to help them understand the demographic and psychographic profiles of their customers and prospects. These traditional segmentations are usually derived from complex clustering techniques than map rich primary research data (usually survey based) into common groups or profiles. These groups are then given highly descriptive business names and rich descriptions and provide a framework for a wide range of marketing activities. Though such segmentations can (and are) applied to online customers, companies that have tried to map these segmentations down to the individual level (for targeting or reporting) in the online world have mostly been disappointed. In Part I of this series, I described the biggest pitfall in extending these segmentations – the near impossibility of mapping demographic and psychographic profiles to visitors about whom we typically know nothing except their online behavior. In this post, I’ll discuss tackling this problem from the other direction – beginning with a behavioral segmentation and adding demographic and psychographic information.
Behavioral Segmentation
Building a behavioral segmentation is no piece of cake. Unlike traditional segmentation, which is complex but well understood, behavioral segmentation on the web has not been routinized. For traditional segmentations, the mechanics of data collection, the types of variables most likely to be interesting and the tools and techniques to produce a good segmentation are all well understood. The most difficult part of a traditional segmentation tends to be the “art” of naming and describing the resulting segments.
Most of this isn’t true for behavioral segmentation. Data collection isn't too big a problem. There are issues with linking online survey data to web behavioral data, but though this is not a slam-dunk it is more of a nuisance than a genuine difficulty. The problems typically begin when we consider the type of variables most likely to be interesting and the tools and techniques to produce a behavioral segmentation.
Companies have struggled to extend their traditional segmentations into the online space, but they’ve also struggled to build any kind of useful behavioral segmentation – and the problem comes from both variable interpretation and tool limitations.
Web behavioral segmentations tend to use the core web behavioral data points (pages viewed, time on site, visits and, sometimes, poor geo-demographic variables like IP-based DMA).
The problem with web-based geographics isn't accuracy - it's precision. A DMA (or even a Zip-Code) is just too large an area with too diverse a population to be useful in targeting or segmentation. Useful geographics need to be at the census-block or zip+4 level. Down at the block level, you have geographics that are as good – and in some ways better – than the actual personally identifiable information you might get from a customer file or match-back.
If you segment on these core web behavioral variables what you get, almost invariably, is a segmentation scheme that looks like an Egyptian pyramid. At the base you have a big number of visitors who do almost nothing. As you ascend the pyramid, you have a few additional levels that represent groups with middling page views and few visits, a group with middling visits and moderate page views, a group with lots of page views and few visits and, at the apex, a group with lots page views and lots of visits.
This segmentation scheme is about as useful as a real pyramid but much less interesting!
To get a useful behavioral segmentation, you need to look at a different set of variables. We’ve found two types of variables that generally produce interesting segments.
There are set of variables around each visitors propensity to view content based on your business taxonomy. Most pages on your web site are focused on specific topics. These may be products, investment strategies, health conditions, news areas, etc. The most interesting set of behavioral facts about a visitor is which of these topics they consume and how much of each topic they consume.
So the first, and biggest requirement for good behavioral segmentation, is usually to have a good site taxonomy. This shouldn’t be a surprise. Most useful web analysis actually happens at the taxonomy level and there are a host of reasons why you should make sure this information flows through to your web measurement.
The second type of variable we’ve found interesting are what we call “session-styles.” Session-styles are designed to capture two salient behavioral facts about visitors – what type of navigational devices (search, directory, link-drives, images, etc.) they use and what type of sessions they typically have. These two questions are intimately related.
Sessions on sites tend to be a mixture of highly-directed (immediate and specific search) to very unfocused (sideways navigation along the top-navigation). Each site will support a range of session-styles that are quite distinct. Before we begin a visitor segmentation, we typically like to start with a session-based segmentation to identify these styles. The styles then become variables at the visitor-level. It turns out that visitors often split along very interesting fault lines in terms of their types of sessions – even when they share a common topical interest.
Combine visitor profiling based on the depth, frequency, mindshare and time-spent by site area and the mixture of session-styles visitors employ, and you will usually end up with quite a rich set of profiles. It will look nothing like the usage pyramid and, in terms of its ability to support rich descriptives, it will rival (but probably not equal) traditional segmentation.
But the very success of these variables carries the seeds of a serious problem. Most traditional segmentations tend to deliver a relatively small number of profiles – somewhere between 5-10 segments. It’s a good number, because it doesn’t burden the marketer with too much apparatus. Having 20 or 25 segments is simply too much to hold in your mind.
Unfortunately, good behavioral segmentations tend to spin off quite a few more segments. These can all be combined, of course. The analyst has control over the number of spaces mapped and you can always force segments together. But in my experience, behavioral segmentations tend to produce more very-distinct segments – segments that do not easily collapse without significant loss of information. I’ll talk about this problem in more depth later on and show some of the techniques we've used to make large numbers of segments both more palatable and more usable.
The tendency to focus on the wrong variables is only one half of the behavioral segmentation problem. The other half is tool-centric. Web analytic tools simply don’t provide the necessary methods to build data-driven segmentations. There is not a single classic web analytic tool – enterprise or otherwise – that has any of the mathematical techniques typically used to build traditional segmentations. As I mentioned in my first post, what web analytics tools have called visitor segmentation is nothing more than primitive rule-based filtering. Even if you had full sql access, you couldn’t do real visitor segmentation. And the filtering you can do in the advanced web analytics tools like Discover (even On Premise) or similar enterprise products doesn’t even come close to having full sql access.
So one way or another, you’ll have to build your segmentations outside the web analytic tool. That’s a major drag and can be a deal killer. Fortunately, we are seeing more clients taking data feeds from their WA tool (often for completely different purposes) – so it’s becoming easier to get your hands on cleaned-up online behavioral data. But even if you have that data, your problems aren’t over. The types of analysis you’ll do to build visitor segments are processing intense. Really intense! You probably won’t be able to run them against your entire web behavioral stream. So now you have to produce a sample (it must be visitor-based not just n-record) and import it into a true analysis tool.
And, since the variables you care about aren’t directly in the data and since most analysis tools will struggle with the general form of the web analytics, you should probably expect to a do a goodly chunk of data transformation and aggregation before you ever get to the segment-building.
It’s making me tired just writing about it, so I suppose it’s no wonder that this hasn’t been a very common undertaking. But with data feeds and access to online data becoming much more common, I expect that the data transformation steps will also get easier. As we do more of these types of projects, the types of variables and the transformations necessary to produce them will become well understood. And once they are well understood, the mechanics will become routinized.
But even though doing behavioral segmentation is still bleeding-edge work, there’s a real advantage to be had at the end of your labors; because a behavioral segmentation – particularly when enhanced with survey-based profiling – can provide a rich and fascinating framework for online marketing. And unlike traditional segmentations, it can be blended back into every aspect of your web measurement: from ongoing deep-dive analytics to reporting to CRM.
In my next post on this topic, I’ll talk about the how/why of adding survey data to the behavioral segmentation. After that, I'll drill down into more detail on segmentation variables and discuss some of the challenges of moving your behavioral segmentation back into your web analytics tool.

Hi Gary,
Thanks very much for this series of posts, which I find very, very interesting!
I have a couple of questions. First of all, I am a bit confused about the two approaches to segmentation that you mention. You say there is a difference between (1) starting with behavioral segmentation and (2) starting with survey segmentation. You also say that whereas the former approach works, the latter doesn’t.
Why is that the case? To my mind, at the end of the day you only have ONE data set containing the relations between behavior, demographics and psychographics – namely the survey data enhanced with the respondents’ clickstreams. Surely it must be this data set you use to construct your rich segments, right? And if so, does it matter where you start?
My other question: What exactly is the purpose of your segmentation? Do you want to use your segmentation for real-time targeting? (For example, if behavior X-Y-Z is typically associated with Segment A, then each time behavior X-Y-Z appears, we will show some unique content which we expect Segment A wants). Alternatively, do you want to use the segmentation for evaluating a campaign or to see which content on your site works best for whom?
I think, in the latter case, it isn’t always necessary to predict the demographics/psychographics of unknown visitors. Prediction is only relevant if you want to plan future actions. If, on the other hand, you want to evaluate a campaign or some content, you can always simply run an online survey at the same time, and then integrate the data with click streams. In this way you don’t have to predict; rather you can see directly how many people from this or that segment came from this or that source and saw this or that content. Of course, you may want to take into account that the survey data are not necessarily representative of all visitors, but this can be done by weighting the data.
Perhaps I should mention that I work for a web analytics vendor which offers a survey module that allows customers to build, launch and automatically integrate online surveys with behavioral data. My colleagues and I have carried out many consultancy projects where we have first launched a survey and then used data mining to segment and analyze the relationships between responses and behavior (see an example here:
http://www.netminers.dk/cms.ashx/!lang=en/analysis/webmapping.html ).
Thansk again for a great post!
Posted by: Christian Vermehren | June 23, 2008 at 06:37 AM
Christian,
I've been hearing about your tool - you'll have to give me a demo sometime! Great comment.
You're comment about the difference in starting points would require a very long answer to deal with and I am going to be talking more about that. However, the short answer is that in most cases you are driving the core segmentation with either the survey data or the behavioral data - not both in a consolidated data set. And you are then "coloring" the segments with the second data. It ends up making a pretty substantial difference which is primary and which is secondary and my experience has been that it ends up being easier to color behavioral segments with demographic and psychographic data than to color traditional segments with behavioral data. I'm not sure I have a complete explanation for why that might be so (though I do have some thoughts).
Why wouldn't you just use both data sets in the initial segmentation creation? You can (and you would if you were doing customer segmentation and routinely had demographic and customer data), but doing so adds risks in terms of applying the segments to all visitors. This combined data set will actually produce the BEST and richest segmentation - but it comes with trade-offs in terms of your ability to extrapolate your segments to all online visitors.
I think a full-on behavioral segmentation (like a traditional segmentation) has myriad uses. I like to integrate it into management reporting, use it for targeting, and for ongoing analysis (and not just of campaigns). Cutting almost any true deep-dive analysis by rich behavioral segments will add significant analytic value. And management reporting that includes the segmentation is often much more interesting and comprehensible.
I do think that survey's are grossly underused for targeted analysis. And I find the lack of flexibility around implementing lots of one-off and targeted surveys a problem with many of our clients. There aren't many analytic deep-dives where I wouldn't love to be able to target a survey and add in highly customized survey data.
Posted by: Gary | June 23, 2008 at 12:31 PM
Hi Gary,
Many thanks for your reply – and interesting that you’ve been hearing about our tool considering our remote location in the northern periphery of Europe! I suppose the Internet has finally turned all of us into McLuhan’s Global Village. Yes, we definitely have to arrange a demo soon…
Anyway, I also wanted to say that I now understand what you mean by “coloring” the segments and using behavior as primary variables for segmentation.
I’m not entirely convinced, though, that this is the best approach. I fear it will collapse the differences in behavior between the psychographic/demographic segments from the survey. This could happen if, for example, there is a big difference between those who participate in the survey and those who do not. This would enable you to see immediately which behavioral segments would be less likely to respond to your survey, but I also fear that too many of the psychographic/demographic segments could end up in the same behavioral segment, if you understand what I mean.
I totally agree that the other approach is problematic when it comes to extrapolating the survey segments to the rest of the visitors... I suppose it is a give-and-take situation.
In any case I look very much forward to reading your upcoming posts on the issues. Perhaps I will be converted then… :-)
Christian
Posted by: Christian Vermehren | June 25, 2008 at 02:26 PM
Interesting article. I agree that “segmentation” in the web analytics world is a highly over-used term. To summarise the problems of determining a “proper” segmentation :-
• The majority of web analytic tools do not provide the data
• Even if they did, then :-
o You’ll (probably) need some sort of ETL process to get the data into a form amenable to the analysis tools.
o Computing segments is expensive and you may need to sample.
o A specialist analysis tool can then be used to define the segments.
We’ve considered building an analysis tool into our product (we’re a vendor in the web analytics space) but to-date have decided that it would not be a good use of our resources. The reasons for this are :-
• The process of defining the business rules to create the segments is a one-off process. It is not something that will be undertaken on a daily or even weekly basis.
• We provide the data (down to the click-level) in a standard relational database. Further, we can apply flexible business rules to the low-grained data (e.g. to determine the “session style”) and this information can be de-normalised up to the visitor level. This reduces the amount of subsequent manipulation that is needed. It also reduces the amount of data that needs to be handled by an order of magnitude
• There are companies who specialise in all the weird and wonderful segmentation algorithms.
So we’ve made sure that it’s (relatively) easy to apply a segmentation and left the segmentation definition to the specialists. It may be that in the future we review this (e.g. if we determine a set of segmentation algorithms that behave extraordinarily well with the shape of web analytics data).
As web analytics becomes more and more mainstream I think this approach is the right one. Web analytics is really a form of business intelligence applied to the on-line channel. It makes sense to leverage as much prior art as possible.
I’ll also mention in passing that we allow users to define reports in SQL (full SQL level access). These SQL reports can be saved and scheduled in exactly the same manner as other reports. We also provide a drag-and-drop report wizard.
Guy
Site Intelligence Ltd (www.site-intelligence.co.uk)
Posted by: Guy Evans | July 08, 2008 at 09:26 AM