[We closed X Change Registration for vendors early last week - more than a month before the Conference. We've had unprecedented demand and with a hard cap on attendees we have to limit the number of vendors we can take or the Conference becomes unbalanced. I really hate to turn people away, but at this point I don't have much alternative. If you're on the Enterprise-side, registration is still open. However, don't count on that lasting through August. If you are planning on coming out, please - Register NOW!]
Key-Joining, described in my last past, is an essential technique for tying Web behavioral data to your customer records in a warehouse. Key-joining describes the HOW of making the connection, but leaves open the even more fundamental question of WHAT to connect. Deciding what data to move from the Web Behavioral stream into the CDW is the single hardest and most important decision you'll make in creating a Customer Data Model for Digital.
This question of WHAT isn't always so challenging. I've worked on countless data integrations and, in many cases, deciding what data to move is almost trivial. That just isn't the case with Digital data and there are a couple of reasons why.
First, digital data exists as an event stream with many instances per customer. Having lots of records per customer makes analysis and selection MUCH harder in the warehouse. Tools like SQL or SAS can handle visitor grouping - but queries that require complex aggregations are harder to construct and often run very slowly. By far the most natural way to extend data in the customer data warehouse is to append fields on a 1-to-1 basis with the customer. With digital data, that can't be done without transformation.
Fair enough, but many data sources exist as a many-to-one relationship to the customer. The simple solution is to aggregate the data. This is where digital data becomes particularly tough. The key to successful aggregation is consolidation at the visitor-level without losing the interesting detail. Most people are familiar with image compression (which is a form of aggregation) and understand that some techniques are lossier than others. It isn't just the techniques, however. Some types of images are much easier to compress than others.
Digital data is very difficult to aggregate intelligently.
When I first started building data models of web behavior, my initial thought was that I could use a straightforward aggregation of the behavioral stream that simply counted the number of instances:
This:
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
- |
|
Page 2 |
1111111111 |
- |
|
Page 3 (Success) |
1111111111 |
12345678 |
|
Page 4 |
1111111111 |
= |
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
- |
|
Page 2 |
1111111111 |
- |
|
Page 3 |
1111111111 |
- |
Becomes this:
|
Visitor |
Visits |
Views |
Success |
|
1111111111 |
2 |
7 |
1 |
Aggregating to Visits, Views, Success Counts, and Total Time is logical and similar aggregations work quite well with many other types of data. Unfortunately, this turns out not be very useful for Digital. The problem is that the really interesting fact about online behavior isn't how much a Customer has - it's what exactly they did. This type of aggregation drops ALL the information about WHAT the Customer is doing.
From a marketers perspective, trying to build targeting or segmentations on this type of data is impossible. Nobody is going to build a campaign around Visitors with 10+ page views. Who are they? What do they care about? What type of offer might they respond to? Simple aggregations of page views and visits are silent when it comes to these questions.
Because this type of aggregation is so unrewarding, many architects have chosen to punt on the problem and simply moved server call level data into the warehouse. It doesn't work. Even if the performance is acceptable, it's tremendously challenging to squeeze meaning out of the page view stream.
As anyone who has ever tried to use a Path Analysis can testify, the page stream multiplies geometrically. It's incredibly complicated on an ad hoc basis to try and select out specific patterns from that stream that might be interesting. What happens, then, is that marketers tend to focus on a few key pages. With the full event stream available to them, they tend to write selections like these:
"Select all visitors who viewed Page X"
or, at most,
"Select all visitors who viewed Page X but didn't view Page Y"
Where Page X might be shopping cart add and Page Y a final checkout.
Sure, this is better than an aggregation based on view and visit counts, but by leaving the data in its raw form it tends to force users to neglect all but a few key milestone pages. Not ideal.
What's needed is a method of aggregating data that is far less lossy than the simple counting method, that captures more than a few key milestone pages, and that truly aggregates the data into a one-to-one relationship with the Customer.
The Two-Tiered Segmentation turns out to be an ideal method.
Two-Tiered Segmentation is our unique approach to Digital Segmentation. The first tier is a traditional Customer Segmentation - a Visitor-Type of the persona or core business relationship sort. The second tier is specifically digital; it captures the visit-type or intent. We describe the first tier as the "Who" and the second tier as the "What": who they are and what they are trying to accomplish.
One of the beauties of the Two-Tiered Segmentation is that it creates a set of natural success metrics in the digital realm. We usually represent the Segmentation as a matrix - and every cell in the matrix has specific (and unique) success measurements.
With a Two-Tiered Segmentation, we can represent a Customer's online experience in a more compact an elegant fashion.
This
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
- |
|
Page 2 |
1111111111 |
- |
|
Page 3 (Registration) |
1111111111 |
12345678 |
|
Page 4 |
1111111111 |
= |
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
2222222222 |
- |
|
Page 2 |
2222222222 |
- |
|
Page 3 |
2222222222 |
- |
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
2222222222 |
- |
|
Page 2 (Login) |
2222222222 |
- |
|
Page 3 |
2222222222 |
12345678 |
Becomes this:
|
Visitor |
Early-Stage Shopping Visits |
Early-Stage Success |
Order Visits |
Order Success |
Status Visits |
Status Success |
|
1111111111 |
1 |
1 |
1 |
1 |
1 |
0 |
I think it's obvious how vastly superior this representation is to either of the other strategies. With a data-model based on a Two-Tiered Segmentation, the first tier ties naturally to the Customer Record on a 1-1 basis. The second tier is represented as a set of columns that capture how often a user has had a certain type of visit, how successful they've been (and, obviously, how often they've failed) and so on.
With this method, an entire visit can be intelligently reduced to a few column counters.
What's lost? The data model I've described is lossy, and the two most serious omissions are time-based and product (or topic) based. With this data model, I couldn't answer questions like:
What product is Customer X most interested in?
or
What did Customer X do most recently?
These are serious omissions. Let's tackle recency first since the model I've developed can be easily extended to include a time component.
Instead of two fields (count and success), you use three sets of fields for each Visit Type (almost a classic RFM model) to represent the entire customer experience on the Web. I've inserted the idea of sets because it's often necessary to use more than one field to fully capture an RFM dimension. For success, I may have multiple success metrics (the M dimension) that I want to track independently. For recency, I'll often want to capture first, most recent, and perhaps velocity or avg. usage. If so, I might need as many as four fields to fully describe the Recency (R) dimension for a visit-type. That's still very compact.
My new model looks like this:
|
Visitor |
Early-Stage Shopping Visits |
Early-Stage Success |
Early Stage Most Recent |
Early Stage First Time |
Early Stage Avg. Per Month |
Early Stage Velocity (Index) |
Etc. |
|
1111111111 |
1 |
1 |
5/15/2011 |
10/7/2010 |
.256 |
113 |
|
Unlike a typical RFM model (that exists at the customer level), this model uses the Two-tiered segmentation so that you're capturing fine-grained detail about each type of activity on the web. In effect, you're creating an RFM model for Customer Support, for Lead Generation, for pre-Purchase Browsing, and, of course, for Purchasing. You're creating an RFM model for every single visit type that you've identified. It turns out to be a model of considerable beauty.
With this type of customer record, a range of powerful customer-level targeted marketing selections become available. Now I can easily (with simple, fast queries) answer questions like:
What did Customer X do most recently?
or
How recently did Customer X look at products online?
or
How long was it between Customer X first viewing an article and registering?
That's worth pondering for a second. With a model based on the Two-Tiered Segmentation and RFM fields, you have a highly aggregated data model that can answer extraordinarily interesting and complex questions.
This representation is superior to any data model commonly in use for Web behavioral data. It's compact, efficient, and powerful. It doesn't solve every problem, of course. You'll notice that because there is no product overlay, the model still doesn't efficiently answer questions like:
What product is Customer X most interested in?
or
What topics does Customer X read most?
I'm mostly using ecommerce here, but the same principle applies on any type of Two-Tiered segmentation and this level of aggregation. On a media site, it's topic level data (as per the 2nd question) that would get washed out.
Capturing that level of detail in my data model is tricky enough that I'm going to reserve it for a separate post. After that, I'll show how this same model can be extended beyond the Website (and even beyond Digital) and become the foundational approach for representing the entire customer journey in the warehouse.
People talk endlessly these days about a full 360 degree representation of the customer in the data warehouse. Needless to say, the people doing the talking aren't doing much walking. They don't have to worry about what this mishmash of data means in the context of an actual working warehouse. You do. Throwing together everything you know (and can find out) about a customer in your data warehouse won't create anything except a sticky mess - an uncooked stew of expensive ingredients - unless you bake it together in a reasonable way. The approach I've suggested here (Two-Tiered Segmentation with RFM fields underneath) can be a recipe for transforming that inedible stew into something truly tasty.
[Click here for a summary of this extended series on Digital Analytics and Database Marketing]

Comments