I’m still working on my blog about reporting – but I saw a post from Omniture’s Matt Belkin (http://www.omniture.com/blog/node/19) that touched on a subject close to my heart and that I wanted to discuss. Matt’s post is on n-dimensional segmentation (it’s from a few posts back but due to the sometimes weird timing of Google Alerts I just happened to see it – do you ever have that experience where Google Alerts drops a listing on you seemingly from out of some time warp?).
Here’s Matt’s description of n-dimensional analysis along with how it might actually look in the real world:
"What is "N-Dimensional?
To make sure we're all on the same page, let me start with a brief overview of the term "n-dimensional". As I suggested above, n-dimensional refers to the concept of limitless dimensions. When coupled with analysis and segmentation, the basic idea is that you can drill into your data in limitless ways. While this may sound like an academic exercise, n-dimensional analysis and segmentation is incredibly powerful and can quickly deliver significant profit to your business.
For instance, you might start with a keyword analysis for the term "web analytics". Select your most relevant metrics, in this case Searches and Revenue, to understand the overall popularity and contribution of this keyword.
Now let's say you want to drill deeper into this keyword to understand where it is most effective and least effective. Your first "dimension" may be products; in other words, you want to see which products are purchased by visitors from this keyword. You see "Apple iPod" tops the list, so you filter on this product to understand these customers better.
You may now wonder if these are new or repeat customers, so you pull up the "Customer Loyalty" report and notice that most purchasers are New. So you add "New" customers to your filter as well.
Now you're curious if these New Customers are younger or older shoppers. So you bring up your Age Group report, and notice that most shoppers are from 18-24. You add this to your filter criteria as well.
At this point, you'd like to understand where all these people are visiting from. So you pull up the "Geography" report and see that most visitors originate from
California
and
New York.
You'd like to try an email marketing campaign to these folks, so you pull up another Products report and analyze which products these visitors looked at but did not purchase. You see that many of these customers also viewed the iPod extended life battery in the same visit, but didn't buy it. So now you extract the customer IDs for these, export them to your email marketing platform, and send out your remarketing campaign."
This is, obviously, an idealized version of events and Matt has a not-at-all-hidden agenda here to talk about Discover. That’s okay by me. Discover is a pretty terrific product - to my mind, the most valuable part of the SiteCatalyst suite of tools. But as frequently as we use Discover and its "n-dimensional" capabilities, there are some aspects of its approach (and that of most other web tools) that I don’t really care for.
I’ve talked in previous posts about some of the disadvantages to traditional OLAP analysis from a data perspective – and in my initial post on visitor segmentation I briefly discussed why I think tools need to provide data-driven segmentation. I’d like to revisit that idea because it is really directly in response to what I take to be a fundamental weakness in most n-dimensional reporting systems and also one of the reasons why web analytics is so much harder than people think.
To get there, however, I want to step back and re-visit Matt’s discussion of N-dimensional analysis. I have a particular way of thinking about N-Dimensional analysis that is heavily influenced by statistical analysis packages and I think it makes the issues much clearer. The simplest view we have is a 1 dimensional variable – in stats terms, a frequency distribution. A frequency distribution is a simple count (or percentage) of the instances of a single variable.
In Matt’s example, we’d start with a frequency of Search Terms and get a table like this:
Search Term Frequencies |
|
Term |
Count |
Web Analytics |
1,252 |
web analysis |
872 |
analytics |
672 |
visitor segmentation |
512 |
… |
Frequency tables are the single most common tool in analysis, but, of course, they aren’t much in the way of segmentation. That comes in with our next step - cross-tabulation. When you cross-tabulate two variables, you produce a grid showing the distribution of behavior in each possible combination. In Matt’s example, he cross-tabulates Search Term with Product Purchased. Usually, in statistical packages, cross-tabulations look like this:
Search Term x Product Purchased |
|||||
Product Purchased |
|||||
Term |
Ipod |
Iphone |
Icamera |
Imirror |
Total |
Web Analytics |
417 |
313 |
313 |
209 |
1,252 |
web analysis |
291 |
218 |
218 |
145 |
872 |
analytics |
224 |
168 |
168 |
112 |
672 |
visitor segmentation |
171 |
128 |
128 |
85 |
512 |
… |
Right away you should start to recognize an interesting aspect of n-dimensional analysis – namely, that what is interesting is always comparative. For example, in the table above, does web analytics drive more to Ipod than analytics or visitor segmentation? This "comparative" requirement also means that a "filter" on a variable is rarely what you want when you are trying to discover (as opposed to prove) a relationship. To see which variables are really significant, you need the full cross-tabulation – not just a filter on one value of a dimension. In our example, for instance, the fact that IPod is number one is completely meaningless - it may be the number one product for every term because it's the lead seller on the site. Indeed, the IPod might be the number one product and actually have a negative correlation with the keyword in question - something we see in the real world all the time. You need to know how the ratio compares by term and product - which means you need to see all of the numbers for all of the search terms and products.
And unless you’re quick in math, it might be hard to tell that I just copied these formulas down. That’s why stats guys almost always show cross-tabulations in terms of percentages – calculated both vertically and horizontally. That technique would give us a cross-tabulation like this:
Search Term x Product Purchased |
|||||
Product Purchased |
|||||
Term |
Ipod |
Iphone |
Icamera |
Imirror |
Total |
Web Analytics |
33.3% / 37.8% |
25.0% / 37.8% |
25.0% / 37.8% |
16.7% / 37.8% |
100.0% |
web analysis |
33.3% / 26.4% |
25.0% / 26.4% |
25.0% / 26.4% |
16.7% / 26.4% |
100.0% |
analytics |
33.3% / 20.3% |
25.0% / 20.3% |
25.0% / 20.3% |
16.7% / 20.3% |
100.0% |
visitor segmentation |
33.3% / 15.5% |
25.0% / 15.5% |
25.0% / 15.5% |
16.7% / 15.5% |
100.0% |
… |
Before I remark that I’ve greatly simplified these tables since most companies will have thousands of terms and (at least) dozens if not hundreds or thousands of products, you're probably thinking that this is beginning to look like a bit of a mess. But we’ve hardly begun. This classic cross-tabulation – a staple of any statistical analysis and probably still the most used analytic technique in survey research – is a 2 dimensional view. With n-dimensional analysis, we can go much further. In Matt’s example, we add visitor loyalty. Many companies break this out into several sub-groups, but I’m going to keep things simple and show a new 3-way cross-tabulation with visitor loyalty added in:
Search Term x Product Purchased x Visitor Loyalty |
|||||||||
Visitor Loyalty |
|||||||||
Product Purchased |
|||||||||
New |
Returning |
New |
Returning |
New |
Returning |
New |
Returning |
||
Term |
Ipod |
Ipod |
Iphone |
Iphone |
Icamera |
Icamera |
Imirror |
Imirror |
Total |
Web Analytics |
33.3% / 37.8% |
33.3% / 37.8% |
25.0% / 37.8% |
25.0% / 37.8% |
25.0% / 37.8% |
25.0% / 37.8% |
16.7% / 37.8% |
16.7% / 37.8% |
100.0% |
web analysis |
33.3% / 26.4% |
33.3% / 26.4% |
25.0% / 26.4% |
25.0% / 26.4% |
25.0% / 26.4% |
25.0% / 26.4% |
16.7% / 26.4% |
16.7% / 26.4% |
100.0% |
Analytics |
33.3% / 20.3% |
33.3% / 20.3% |
25.0% / 20.3% |
25.0% / 20.3% |
25.0% / 20.3% |
25.0% / 20.3% |
16.7% / 20.3% |
16.7% / 20.3% |
100.0% |
visitor segmentation |
33.3% / 15.5% |
33.3% / 15.5% |
25.0% / 15.5% |
25.0% / 15.5% |
25.0% / 15.5% |
25.0% / 15.5% |
16.7% / 15.5% |
16.7% / 15.5% |
100.0% |
… |
Thank goodness visitor loyalty has a cardinality of 2 (there are only 2 values possible). But we aren’t going to stop here. The next slice Matt proposes is by age. Let’s assume a cardinality of six. Here’s what our report will look like:
Search Term x Product Purchased x Visitor Loyalty x Age |
||||||||
Visitor Loyalty |
||||||||
Product Purchased |
||||||||
New |
Returning |
New |
Returning |
New |
Returning |
New |
Returning | |
Term |
Ipod |
Ipod |
Iphone |
Iphone |
Icamera |
Icamera |
Imirror |
Imirror |
Web Analytics |
33.3% / 37.8% |
33.3% / 37.8% |
25.0% / 37.8% |
25.0% / 37.8% |
25.0% / 37.8% |
25.0% / 37.8% |
16.7% / 37.8% |
16.7% / 37.8% |
<18 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
18-35 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
36-50 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
51-65 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
65+ |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
web analysis |
33.3% / 26.4% |
33.3% / 26.4% |
25.0% / 26.4% |
25.0% / 26.4% |
25.0% / 26.4% |
25.0% / 26.4% |
16.7% / 26.4% |
16.7% / 26.4% |
<18 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
18-35 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
36-50 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
51-65 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
65+ |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
Analytics |
33.3% / 20.3% |
33.3% / 20.3% |
25.0% / 20.3% |
25.0% / 20.3% |
25.0% / 20.3% |
25.0% / 20.3% |
16.7% / 20.3% |
16.7% / 20.3% |
<18 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
18-35 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
36-50 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
51-65 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
65+ |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
visitor segmentation |
33.3% / 15.5% |
33.3% / 15.5% |
25.0% / 15.5% |
25.0% / 15.5% |
25.0% / 15.5% |
25.0% / 15.5% |
16.7% / 15.5% |
16.7% / 15.5% |
<18 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
18-35 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
36-50 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
51-65 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
65+ |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
… |
Keep in mind that I’ve artificially truncated both the rows and the columns. If we assume a company with 1000 fairly important Search Terms and 100 products, then our 4 dimensional matrix above will have 1000X100X2X6 or 1,200,000 cells in it. I venture to suggest that most analysts will have a difficult time picking out the important information. It’s for this reason that most real human analysts never go beyond 3-dimensional cuts – no matter how many are actually available.
But why stop there? In the example, we add geography next. It looks like this is at the state level in our example. I’m not much of a fan of geographic targeting at the state level – in general, anything higher up than DMA is completely useless. But let’s stick with the relatively low cardinality of state. Our report now becomes a behemoth of which I will show only a tiny slice:
Search Term x Product Purchased x Visitor Loyalty x Age X State |
||||||||
Visitor Loyalty |
||||||||
Product Purchased |
||||||||
New |
New |
New |
New |
New |
New |
New |
New | |
Term |
Ipod |
Ipod |
Ipod |
Ipod |
Ipod |
Ipod |
Ipod |
Ipod |
Alabama |
Alaska |
Arizona |
Arkansas |
California |
Colorado |
Connecticut |
Delaware | |
Web Analytics |
33.3% / 37.8% |
33.3% / 37.8% |
25.0% / 37.8% |
25.0% / 37.8% |
25.0% / 37.8% |
25.0% / 37.8% |
16.7% / 37.8% |
16.7% / 37.8% |
<18 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
18-35 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
36-50 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
51-65 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
65+ |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
web analysis |
33.3% / 26.4% |
33.3% / 26.4% |
25.0% / 26.4% |
25.0% / 26.4% |
25.0% / 26.4% |
25.0% / 26.4% |
16.7% / 26.4% |
16.7% / 26.4% |
<18 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
18-35 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
36-50 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
51-65 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
65+ |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
Analytics |
33.3% / 20.3% |
33.3% / 20.3% |
25.0% / 20.3% |
25.0% / 20.3% |
25.0% / 20.3% |
25.0% / 20.3% |
16.7% / 20.3% |
16.7% / 20.3% |
<18 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
18-35 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
36-50 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
51-65 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
65+ |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
visitor segmentation |
33.3% / 15.5% |
33.3% / 15.5% |
25.0% / 15.5% |
25.0% / 15.5% |
25.0% / 15.5% |
25.0% / 15.5% |
16.7% / 15.5% |
16.7% / 15.5% |
<18 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
18-35 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
36-50 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
51-65 |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
65+ |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
25% / 20% |
… |
This new matrix has 60,000,000 cells (I know - the numbers shown make no sense). But the killer is still to come, because all of this is really just one behavior (product purchased) crossed with four dimensions. We often want, as in Matt’s example of evaluating products looked at but not purchased, to cross multiple behaviors. These tend to be higher cardinality. In this case, we’ll have another variable with 100 different values (and that is obviously very modest by most retail standards). This would yield a cross-tabulation with six billion different cells in it.
I’ve gone through this exercise because I think it illustrates two key points about web analytics. First, there’s a good reason why analysis sessions never go the way Matt describes. Finding important combinations of variable is ALWAYS like hunting for needles in very big haystacks. I’ve shown how the six dimensional cross-tab suggested by Matt’s example actually encompasses six billion possible routes (in a conservative example). But the truth is really much worse – because the analyst has many more variables to choose from and some with even higher cardinality. It’s no wonder that interesting stuff hardly ever emerges from mere "exploration" of the data – especially without specialized exploratory techniques (such as some forms of visualization). N-dimensional analysis is useful for testing specific hypothesis about visitors – but it is almost useless for finding them.
I want to pause there because I think this point is important. Web Analytics is hard. It isn’t a simple process of applying a few segmentations to obvious variables and suddenly realizing that you can change your business and dramatically improve your website. And the example I’ve given here begins to suggest (at least in some part) why real-world analysis is so much trickier than picking cherries from a tree. Because in the real world, the analyst is in an orchard where only a few of the cherries actually mean anything (are worth picking) and there are millions and millions of cherries to choose from.
The simple mathematics of combination show that a web analyst is faced with more choices than a chess grandmaster in a complicated mid-game. Stumbling hit or miss upon the best move is about as likely as you or I, playing blind, beating Kasparov or Fischer.
And that brings me to my second point. I’ve written before that I think web analytics tools should provide data-driven segmentation. A computer can (easily) sift through those 6 billion cells and tell me which relationships might have significance. Statistical techniques (like neural nets) have existed for many years that can classify visitors into segments based on many combinations of variables (far more than 6). Not only can these techniques do this, they can do it a lot more intelligently. A basic problem with traditional n-dimensional analysis of this sort is that it forces simple on/off decisions about which cells to place a visitor in. A visitor can reside in only a single cell in the entire matrix – and there is no potential for weighting variables or looking at clusters of behaviors.
Neural network and clustering techniques do both these things – allowing you to form groups of visitors with many like behaviors – even in cases where a visitor doesn’t meet 100% of a behavioral specification. This provides a means of grouping "cells" together in an intelligent fashion – but it is really even more than that since the visitors are being classified individually and not at the cell level. That's really important, because it means you can use scalar variables like total purchases, total revenue, site visits or time in an area as meaningful differentatiors between otherwise similar visitors. This makes it by far the best way of solving a common and important business question – find me visitors who are "like" buyers but didn’t buy. Without data driven segmentation, this question is – in my opinion – almost impossible to answer correctly.
Data-driven segmentation not only relieves the analyst of the nearly impossible task of finding needles, it does a vastly better job than any human possibly could. Which is why I’d love to see the web analytics vendors provide this capability. For now, I realize that there is not a lot of pent-up demand for this feature. But as more and more analysts use tools like Discover, there is going to come a time when the limitations of OLAP-based, human-driven n-dimensional analysis are painfully obvious. And the emphasis – in that last sentence – should be on the word pain.

That's a pretty cool post ... I get the gist of it without really understanding all the details (as we discussed - I'm not a programmer).
But, I would say this - in a way, the web analyst, sometimes in a flash of intuition, can pick the needle out of the haystack (I think that's what the Zen masters taught - not that I know how to do that).
Strange as it seems, I believe the human mind has the capacity to pull out significant patterns in vast sets of data (IE: Savants) - I just don't know if it can be turned on and off at will.
OK, so the point of your post - I think - when we deliberately segment (using "n-cubes") we're really stuck in one point in a vast array of data and - that one point we have to figure out based on up to 3 levels of segmentation (as we discussed in NY). But, the other method you mentioned - Neural Net, would allow a visitor to fall into more that one point of data and the programming (computer) would define the variables for us - all we'd have to do is tell the computer "I want to find someone who is like a buyer but did not buy" and the computer, via neural nets, figures out the rest.
That's really cool - I hope one of the Web Analytics vendors picks up on this and does this. VS is the most keyed in - they may have the horsepower to do it.
Thanks again for a great post - and I will post on this today as well.
And thanks again for dinner in NY -
Marshall
Posted by: Marshall Sponder | January 28, 2007 at 12:14 PM