[Late last year we ran a survey on the state of digital analytics in the enterprise. Phil Kemelor (who leads that exercise for us) just published some thoughts on the initial findings on CMS Wire. If you’d like to see how companies compare in terms of digital analytics organization and effort, check it out!]
My last post – a pretty technical deep-dive into the challenges around non-lossy visit-level aggregation of digital data - was surprisingly popular. That’s always nice, but I was a bit chagrined. Although I think it was a pretty good post, I realized afterward that there were two aspects of it which weren’t that great. From an overview of big data, it dived right into a discussion on visit-level aggregation. Looking back I see that I didn’t really explain why that might be a logical place to start when thinking about a digital data model. Even worse, I realized that it probably wasn’t the best place to start at all! So while I’m still going to write the promised post on Statistical ETL and how it can be used to provide different and in some ways better visit-based aggregations, I’m going to first take a stab at an even more basic part of the digital data model (and one that people get wrong ALL the time).
When you get a data feed from your digital analytics vendor (and I’m going to focus on Adobe as the most common case but the differences aren’t that great), you get an event-level feed of every piece of data collected by your vendor. It’s typically more than 400 columns wide and includes a mix of disparate events that include page views (which usually dominate), link clicks, video plays and, for some feeds, events. The format is fixed for all records, but the fields you use are often quite different and it’s not unusual for certain fields to be used for more than one purpose.
In a traditional relational database, we’d never use this file except to support the ETL necessary to create a more robust data model that included a page-view table, a visit table and a visitor table as well as a bunch of reference tables. For the relational world, that’s the right strategy.
In the big data world, it’s not at all uncommon to use this file as is and make it the hub of your queries. That’s a mistake. Just because big data systems work on detail data doesn’t mean that every detail table is at the right level and in the right format.
The detail data you get from Adobe (and other vendors) is a strange hodgepodge. Not only does it mix event types, but the data is often stuffed into records in a rather haphazard fashion. Success events, for example, are stored as a string of numbers in a text field attached to a page view or a link click. Referring site is passed on every record, but it’s only meaningful for the first record (simplification here, but illustrative). Simple keys for session and sequence are generally missing. The visitor key is, inexplicably, split into two fields. Significant information like avg. time spent on the page is implicit not explicit in the data. Product cart adds and detailed checkout information are stored in an irregular fashion that you have parse. The list of peculiarities goes on and on.
To make matters worse, the data is mostly stored in very inefficient formats and there are hundreds – literally hundreds – of redundant and unnecessary fields.
So if you keep this detailed file in anything like it’s native format and just start querying it, you’ll find that it’s very, very hard to use. It will take complex logic to build even the simplest queries against the data. Your analysts will likely mis-query or mis-represent the data and they probably take subtly different approaches to filling in the necessary pieces. Plus, every single time you pass the event table, you’re passing 10x more data than is actually necessary and doing lots of extra processing on it to boot.
There’s nothing magical about the source format of data. There just isn’t. The guys who put together the data feed format from your vendor probably didn’t spend minute one thinking about how you might use it.
So the very first thing you should do when you start thinking about a big data model for digital is to re-think the event-level data.
I believe that a thoughtful big data transformation of this event-level view for digital will likely focus on two different kinds of issues and may actually produce two or more separate detail files. The first set of issues is around cleaning up the basic data problems: redundancy, poor formatting and essential missing fields. This stuff isn’t necessarily easy but it isn’t very interesting from a data modelling perspective. You could take the raw stream, keep every record, and just fix these issues. This is essentially non-lossy and can be done as you load the data onto the system so that the only version of the data is the cleaned-up detail. The second set of issues is focused on something more fundamental: what’s the right way to structure the data as a series of events? That’s the issue I’m going to take up here and the answer will likely involve a lossy version of that scrubbed detail file.
The mechanics of digital data collection make the container (the page view or the link) the central unit of data collection. We tend to think of a Web visit as simply a series of page views. I don’t think that’s always optimal – particularly when it comes to representing the beginning of a visit and for capturing key events inside the session.
A typical (hugely simplified) Web visit might look like this in the data feed:
Timestamp, Event Type, Page X, Referring String, Campaign, Events
Timestamp, Event Type, Link to Page Y, Events
Timestamp, Event Type, Page Y, Referring String, Campaign, Events
Timestamp, Event Type, Page Y, Referring String, Campaign, Events including Cart Add
Timestamp, Event Type, Link to Page Z, Events
Timestamp, Event Type, Page Z, Referring String, Campaign, Events including Checkout Start
Timestamp, Event Type, Page AA, Referring String, Campaign, Events including Checkout Complete
Let’s ignore the fact that things like Events are stored as a sequence of comma separated event numbers in a string field with certain types of events (ecommerce) having additional name value pairs inserted into the string. We’ll assume all that’s cleaned up and rationalized.
For most query purposes, it’s still an inconvenient way to store the data. I think it’s better to make the start of the session explicit, break out key events from page views, and make it easier to query on the session flow. In other words, you should actually unpack some of the records into a lower-level of detail to make the data more consistent.
Something like this:
Timestamp, Event Type, Event Name, Event Flags, Event Details, Sequence, Previous Event Data, Time on Event, Time Before Event
With this structure, the session is started with a source (referring site / campaign) that becomes a separate event with its own flags and detail. This is followed by each page view or significant event – with events broken out from page views and represented separately. In addition, we’ve made it super easy to build queries on previous page and to understand the time on page and the time before any event. A sample session might look like this:
012315:18:01:01, Session ID, Session Start, Referring Domain, Event Flags, Campaign Coding, 0, Previous Session Id, Total Session Time, 0
012315:18:01:01, Session ID, Page View, Page X, Event Flags, Page Detail, 1, Referring Domain, 72, 0
012315:18:02:13, Session ID, Page View, Page Y, Event Flags, Page Detail, 2, Page X, 90, 72
012315:18:02:43, Session ID, Cart Add, Page Y, Event Flags, Cart Detail, 3, Page X, 60, 102
012315:18:03:43, Session ID, Page View, Page Z, Event Flags, Page Detail, 4, Page Y, 50, 182
Structuring the data in this fashion makes it much easier to write queries about session level data, queries around simple 1 page paths, queries that want to look at session level data for sessions with particular pages, queries that want to analyze only significant events and queries to do analysis on when things happened or the time between events.
This event-based structuring of the data is also a good fit for mobile data and for ajaxy components like builders, configurators, and games. It's more natural than the page-based model and makes it easier to analyze units of work and what goes on it inside them. It also makes it much easier to query the time between key milestones - indicative of how long it took for someone to complete a unit of work.
In addition to making a bunch of important queries easier to write and fitting a broader range of data sources, I happen to think it also serves to make the data more understandable. That’s no small virtue in a data model.
The above records aren’t meant to be anything like complete (I didn’t show visitor id for example) and there are many additional fields that get attached to the page. I do want to emphasize, however, this is intended to be a lossy view. Not only would I recommend dropping many fields, but I’d be inclined to drop all link data as well. Queries and analysis that include (non-event) links tend to be very different and much less common that queries that are concerned with pages & events. So for most link analytics, I’d rely on either the non-lossy cleaned-up detail or a separate detail file specifically structured to tackle site navigation questions (that might be worth a future post since it’s an interesting problem in its own right).
It isn’t nearly as hard to build a good event-level table as it is to build a good visit-level table. The event-level is much closer in nature to the raw data, the transformations are much simpler and because so little information need be discarded the choices are much less painful. The event table I’ve suggested here isn’t really that far from the source format. However, by creating a structure that is much more compact, that makes time and sequence easier to access and more explicit, and that organizes the data in a more logical and understandable fashion, you can take a great first step in building a big data digital data model that really works.