In my last post in this extended series on Web analytics and Database Marketing, I summed up the discussion thus far, covering roughly six months of blogs. I can't keep summing up, so in going forward, I'm just going to reference that post and assume that the reader is familiar with what I mean when I talk about Website structure, Two-Tiered Segmentation and Meta-data. I was hoping to get this in over the weekend but the sheer size of these last two posts just defeated me. Though quite sick of sProps, I'm seriously (not) considering leaving Semphonic and finding a job where I get paid by the word!
Central to today's digital analytics is the effort to operationalize digital data by integrating it into Customer data marts. In almost every case, that integration needs to happen at the customer-level - even when the customer-level is just an anonymous cookie. It's at the customer-level that we can make the bridge between the type of information we need to target and the behaviors we measure. It's at the customer-level that virtually any out-bound or dynamic messaging system will work.
For most applications, creating the join between the source channel and the customer record is simple and straightforward. Either a key exists, or it doesn't. Where it exists, it's typically on a one-to-one basis with the customer. In such cases, the biggest decision is usually which data to move from the source channel to the customer record. Most channels collect far more data than is likely to be useful in a centralized customer data record, and, in any case, it's desirable to limit the number of fields in that record. Having too many columns of data makes it hard for users to really grasp the data model - and you can't effectively use the data unless you can understand what's there.
In the online world, however, the join at the customer-level raises significant issues that need to be addressed. Web data is not keyed with a true customer identifier. The key in systems like Omniture or Webtrends is a visitor-id that is stored in a user-cookie. This visitor-id is generated when a visitor arrives at a Website without the cookie.
Most Web analytics folks are deeply familiar with the vagaries of cookies. But modelers on the warehouse side may be less familiar, so here's a brief primer. Cookies are small files stored on the client-side machine accessing a Website. Not only are cookies machine specific, they are browser-specific. Each browser uses their own cookie files. A cookie is a very simple file (typically it stores one or more name-value pairs in raw text form). For most Web analytics solutions, the relevant cookie stores nothing but a unique visitor-id. When a visitor arrives at a Website, the cookie for that domain is automatically sent to the site's web servers. If no cookie is sent, then a new cookie is created and a visitor-id is dynamically generated. This cookie can be (and typically is) set to persist in perpetuity.
In theory, therefore, cookies provide a way of tracking visitors (by machine/browser) over time.
The "in theory" is apt, however, because cookies have a host of problems. I've already touched on the fact that cookies are both device and browser dependent. So a customer visiting a Web site from the office and from home will have a different Web visitor-id. A customer using a different browser will have a different Web visitor-id. A customer visiting on their mobile phone will have yet another visitor-id. This channel dependence is just one piece of what makes cookie-based visitor identification and tracking a problem.
Worse is that cookies can be rejected or easily deleted, and it's around this issue that I must introduce the critically important distinction between 1st and 3rd Party cookies. A 1st Party cookie is one that belongs to the domain being visited. A 3rd Party cookie is dropped by a domain different than the one being visited. In the early days of Web tracking, most cookies were 3rd Party - issued by the measurement vendors. This 3rd Party cookie actually facilitated tracking since it made it easy to store a single visitor-id across multiple domains.
As browser technology evolved, however, users became much more likely to block 3rd Party cookies. Unlike 1st Party cookies (which are often essential to site function), 3rd Party cookies rarely have any user benefits and this made them quite prone to rejection. Today, about 15-20% of all browsers simply reject 3rd Party cookies. That means that 1 in 5 visitors are completely untrackable if you are using a 3rd Party cookie. The number is much, much lower for 1st party cookies - probably under 2%.
Naturally, this places a premium on 1st Party cookies for visitor measurement, but it also introduces a complication. Most enterprises have multiple sites. Using a 1st Party cookie for each will maximize the quality of visitor tracking; however, it will eliminate the ability to track visitors across domains.
It's important too, not to forget the fact that cookies can be and periodically are, deleted by the user. This happens for a variety of reasons, but most studies suggest that the average life-span of a 1st party cookie is probably no more than about 3 months.
What all this means is that the Customer Data Model in the Warehouse has to handle two significant types of key joins. First, it has to extend known keys to unkeyed records. Let's assume you're passing a customer identifier to Omniture (or other system) whenever someone logs-in or executes a transaction. Most records in the event stream won't have a Customer identifier. Since you don't want to do key joining dynamically every time you run a query, keying these records with a unified customer identifier should be part of your basic ETL.
This:
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
- |
|
Page 2 |
1111111111 |
- |
|
Page 3 (Log-in) |
1111111111 |
12345678 |
|
Page 4 |
1111111111 |
- |
Becomes:
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
12345678 |
|
Page 2 |
1111111111 |
12345678 |
|
Page 3 (Log-in) |
1111111111 |
12345678 |
|
Page 4 |
1111111111 |
12345678 |
Adding a customer identifier to records where it's missing is fairly straightforward. A more complicated case is one where an entire session lacks a customer identifier:
This:
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
- |
|
Page 2 |
1111111111 |
- |
|
Page 3 (Log-in) |
1111111111 |
12345678 |
|
Page 4 |
1111111111 |
|
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
- |
|
Page 2 |
1111111111 |
- |
|
Page 3 |
1111111111 |
|
Becomes this:
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
12345678 |
|
Page 2 |
1111111111 |
12345678 |
|
Page 3 (Log-in) |
1111111111 |
12345678 |
|
Page 4 |
1111111111 |
12345678 |
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
12345678 |
|
Page 2 |
1111111111 |
12345678 |
|
Page 3 |
1111111111 |
12345678 |
To accomplish this, you've joined on the Omniture identifier across visits. This implies that your ETL has access to a lookup of Omniture ID to Customer ID. Keep in mind, however, that this isn't a one-to-one relationship. You have to deal with cases where a visitor is first associated with one ID, then gets associated with a second identifier by deleting their cookie or logging-in from a different machine:
This:
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
- |
|
Page 2 |
1111111111 |
- |
|
Page 3 (Log-in) |
1111111111 |
12345678 |
|
Page 4 |
1111111111 |
= |
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
2222222222 |
- |
|
Page 2 |
2222222222 |
- |
|
Page 3 (Log-in) |
2222222222 |
12345678 |
Becomes this:
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
12345678 |
|
Page 2 |
1111111111 |
12345678 |
|
Page 3 (Log-in) |
1111111111 |
12345678 |
|
Page 4 |
1111111111 |
12345678 |
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
2222222222 |
12345678 |
|
Page 2 |
2222222222 |
12345678 |
|
Page 3 (Log-in) |
2222222222 |
12345678 |
The most complicated case you have to deal with is one where an entire session (or sessions) need to be re-keyed AFTER processing. This happens when a user deletes their cookie, doesn't re-identify in a session, and then subsequently re-identifies in a later session with the new ID. It also happens when a visitor first arrives anonymously and then, after one or many sessions, finally identifies.
This:
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
- |
|
Page 2 |
1111111111 |
- |
|
Page 3 (Log-in) |
1111111111 |
12345678 |
|
Page 4 |
1111111111 |
= |
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
2222222222 |
- |
|
Page 2 |
2222222222 |
- |
|
Page 3 |
2222222222 |
- |
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
2222222222 |
- |
|
Page 2 |
2222222222 |
- |
|
Page 3 (Log-in) |
2222222222 |
12345678 |
Becomes this:
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
1111111111 |
12345678 |
|
Page 2 |
1111111111 |
12345678 |
|
Page 3 (Log-in) |
1111111111 |
12345678 |
|
Page 4 |
1111111111 |
12345678 |
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
2222222222 |
12345678 |
|
Page 2 |
2222222222 |
12345678 |
|
Page 3 |
2222222222 |
12345678 |
|
Page |
Omniture ID |
Customer ID |
|
Page 1 |
2222222222 |
12345678 |
|
Page 2 |
2222222222 |
12345678 |
|
Page 3 (Log-in) |
2222222222 |
12345678 |
What's challenging about this case is that the 2nd Session (and potentially many intermediate sessions) don't have any possible key that could be tied to the customer id when they are first processed. This means that you'll initially process them as anonymous visitors. Then, when the customer identifies themself, you'll have to go back and re-attach them to the correct customer identifier and erase the anonymous visitor.
This can be a brutally expensive process - that's the reason that Web analytics vendors don't do it. It will also cause you're unique visitor counts to change dynamically. What fun! But if you're serious about customer keying, you'll want to give it a go. At minimum, you should be handling all of the cases that don't require you to back-fill completely empty (of a customer identifier) sessions, but I'd encourage you to try and solve the whole problem.
Key joining is that important.
Keep in mind that this type of key joining isn't just a single site and Web issue. It's the same technique you'll use to marry data across domains when using multiple 1st party cookies and it's the technique you'll use to marry mobile data (for example) to fixed web data.
Key joining is just the first step in the ETL necessary to transform raw Web data into a useful Customer Data Model. In my next post, I'll take up the role of Two-Tiered segmentation in solving the single most challenging aspect of modeling Web behavioral data - aggregating event-level data up to the Customer Record without losing all of the interesting and actionable information.

Hi Gary,
Very interesting post and very practical too - that's rare. Most posts on that topics are often very vague.
I am especially interested in that topic - merging WA & Customer data - as it is a process I am working on. While it certainly requires a lot of investments and work, I really believe it is the way to go to gain true customer insights. Especially in a multi-channel world.
I hope to share my own experience on that once I got there - as you did.
Cheers,
Michael
Posted by: Michael Notte | August 03, 2011 at 04:18 AM