A Quick Introduction to Big Data
Does the term big data get at something real or is it just a handy catch-phrase for helping IT departments get more budget and IT vendors sell new kinds of boxes? It’s not an easy question to answer, largely because most of the people buying and selling big data have largely missed the real point of it. Most of those buyers and sellers have tended to describe “big data” in terms of things like volume, variety and velocity. We have more, they say, and this somehow changes the problem.
People have struggled to do anything useful with most big data sources. And while those sources do have lots of volume, and sometimes have lots of variety and velocity, these are not the root cause of our struggles. Analysts can, and have, handled cases of too much volume in their data with the simple but perfectly acceptable expedient of sampling. Variety – a multiplicity of data sources – always creates integration challenges. But it doesn’t take a data scientist or a new technology to join data if you’ve done a credible job creating keys. And without keys, no big data system in the world does any better at putting data together. There are, it’s true, some advantages to systems that do late binding of data definitions when you’ve got lots of data sources. But this point would hardly justify new technologies and gains much of its apparent force from a confusion of production systems and analytics sandboxes. Give me a SQL-Server instance where I have complete control over the schema, I can add and drop tables at will, and load whatever data I want, and the practical difference between early and late binding of fields and business rules shrinks considerably. As to velocity, one can test the efficacy of this simply by asking whether the analysis and the data actually need to be done in real-time. Real-time does present many unique challenges to both technology and analytic method. But the overwhelming number of big data systems I see no more address real-time than did their traditional predecessors. Batch is batch.
So if the size of data, the variety of data and the velocity of data aren’t really that interesting (I’m not even going to waste time with the absurd idea that data quality is somehow materially worse than it used to be), it would be tempting to think that big data is just hype. That isn’t my view.
What’s changed in the big data world is the level of data we’re dealing with and the type of analytics method required to understand it. Whether you’re thinking about utilities smart-meter data or website log data, what’s fundamentally different in the big data world is that, at the level of data we collect, each row lacks meaning and that simple aggregations of those rows are either too “lossy” to be useful (digital data) or roll-back to the same aggregations we had before we collected detail data (utilities data).
In our digital data world, each row typically represents a single event on the Website – usually a page view. An individual page view, however, says almost nothing interesting about the visitor who did it. To understand the meaning of that page view (and to decide if it even has a meaning), we need to understand that page view in relation to the rest of the session and sometimes even to events before and after the session.
Because of this, digital data creates a situation where the sequence of events, the time between events, and the pattern of events are all critical to understanding the meaning of the individual records. What’s more, simple aggregations of the rows, if done in traditional fashion, will lose the sequence, time, and pattern aspects of the data.
This simple fact has profound implications for analysis of this data. Our traditional tools in both the relational database world and the statistical analysis world rely on the row to be a critical unit of meaning. And while we often took data in at a lower level and then aggregated it, those aggregations didn’t have to take account of sequence, time, and pattern – they were almost always simple exercises in counting and summing.
The need to handle sequence, time and pattern break traditional SQL (which is awful at each of these things), make many of the most popular statistical analysis methods largely irrelevant, create huge headaches in traditional relational modelling, and make common joining strategies much less useful.
In other words, they challenge, vastly complicate or flat out break the pillars of traditional IT and analysis. That’s why the big data challenge is real.
Indeed, this deceptively simple change in the requirements for aggregation of data is, I believe, almost entirely responsible for the struggles of traditional systems and methods with digital (and other big) data sources.
With digital data, we get a very detailed stream of events. At the lowest level, these events are a combination of page views and intra-page events like link clicks or client-side UI events (faceting, exposures, rotations, plays, etc.). A typical web session may have anywhere from a half-dozen to hundreds of these events. Of course for some types of analysis, including most intra-session site studies, you’ll want the data to be in exactly this form. You can’t analyze site usability unless you can drill down into the low-level components of site navigation. However, if you want to analyze questions about visitor behavior and journey, leaving the data like this is problematic.
In this detailed stream form, the data is extremely hard to use in traditional systems. I should amend that statement: it’s extremely hard to use in ANY system. The unit of meaning – the visitor – is quite removed from the physical data. This makes joining digital data streams to other customer data nearly impossible and makes it very hard to draw understanding out of the granules of data.
So how would you aggregate these events into something more digestible at the visitor level?
The obvious answer is to start with session-level aggregations.
Using only basic counting and summing ETL, I can quite easily create a session record for digital data that looks something like this:
Session ID, Start Timestamp, Total Duration, Total Pages, Referring Site, Referring Campaign, Geo Location DMA
Some of these fields are trivially extractable from the raw data stream. Session ID is either extractable or, in some data feeds, must be generated but a session is easily inferred. Start Timestamp is simply the timestamp of the first event in the session. Likewise, that first event will contain the Referring Site, the Referring Campaign, and the Geo Location. Total pages is just a count of page view events with the same session id. Total Duration is typically calculated as the time between the first session event and the last session event (SiteCatalyst used to do this differently – and worse – but now takes advantage of all events not just page views to calculate times).
There’s nothing very hard about this, but there’s also nothing very good about this. In aggregating the session this way, we’ve stripped almost everything meaningful out of the data. It’s simply impossible, using this aggregation, to understand what the visit was about and whether it was successful. The greatest analyst in the world couldn’t squeeze much meaning out of this aggregation.
To rectify this, the next logical step is to flag sessions with important success events. For an ecommerce site, a session aggregation will almost certainly contain total purchase value and number of products purchased. Other sites might choose to track leads, registrations, ad load value, downloads, and various and sundry other success events.
This gives us a row record something like this:
Session ID, Start Timestamp, Total Duration, Total Pages, Referring Site, Referring Campaign, Geo Location DMA, Success Event Type, Success Event Value, Success Event Details
With a record like this, you can answer certain types of questions. You could, for example, build a report of campaign performance that tracked lands, successes and success value by campaign. You could track success by geography. You could even correlate engagement (total pages and duration) to success – though to do so would be deeply misleading. But if this aggregation begins to be meaningful, there’s still far too much lost. If you are successful with 10% of your sessions, that means the other 90% have exactly the same information they always had. Worse, you have no way of understanding what’s happening in those non-success sessions. Are they failed sessions or something fundamentally different? You’ve also opened up only a very limited window into success. Not only don’t you know much about the success itself, you’ve lost all the interesting data about what might have driven that success.
We’ve used two different techniques for building out a better session record with traditional ETL. One is content category counting and the other is milestoning.
With content category counting, you count the number of views and the duration spent within key page categories on the site. You might, for example, count views and time spent by product category or by site section. Content counting will provide a much richer view into what a session contained.
One challenge to content category counting is that you have some tough decisions to make about how many categories are appropriate and sustainable. You can lots of columns, but the finer your detail, the harder it is to use the data for more generalized analysis and the more cumbersome your record. You also have to make choices about what kind of categorization is appropriate. On a retail or CPG site, you’ll almost certainly think about product categorizations. On many other sites, the way content is divided up by site sections will likely be your first choice for a categorization. But there’s no reason why you have to stick to one categorization nor is it necessary that a categorization be comprehensively represented. You might, for instance, have a bunch of product categorizations but leave some secondary product categories out. You might also supplement a product categorization with some site categorizations (like discounts, registry, or even Functional categories).
At this point, it’s a little harder to say what a session record will look like because content categorizations are necessarily site specific. Here’s a general template:
Session ID, Start Timestamp, Total Duration, Total Pages, Referring Site, Referring Campaign, Geo Location DMA, Success Event Type, Success Event Value, Success Event Details, Category A Views, Category A Time, Category B Views, Category B Time…
This type of aggregation adds lots of color to the profile of a visitor’s online behavior in a session. However, it’s still quite lossy. Though we’ve recovered a great deal about what the visitor saw, we’ve sacrificed any sense of sequence and pattern.
A technique I’ll call milestoning can be used to partially address that loss. The idea behind milestoning is simple – we want to attach some information about where in the visit something happened to the view and success records. One way to do this is to simply add two fields to every sub-group of fields (Success and each Category type). These two fields capture the order of events in which the user accessed that content in the session and the time in the session prior to the first access.
With this type of model, we’d now have four fields for every Category we track:
Category A Views, Category A Time, Category A sequence, Category A Time Before first access
With this data structure, we can not only answer which content a user saw, but we can answer questions like which content users saw before (or after) success, which content users saw just prior to seeing content X, how long on average users spend before seeing content X, and which content users tend to see first. These are very common and very powerful questions and milestoning provides a relatively simple data structure for answering them - and it's quite a bit easier to answer those questions off this data structure than if you have to access the detail data. Having said that, I won’t deny that writing certain types of queries against this structure can be daunting.
There are other very plausible choices for representing milestone data. You might, for example, choose to store an array of codes that represent (in time order) key actions on the site. You could also have each record (row) represent a block of actions (not a row per page view – you already have that). This, of course, allows you to handle far more granular views of action since you’re no longer constrained by the columnar representation but it has its own fairly obvious cost.
Creating the ETL to capture milestoning is no longer a trivial endeavor. While we’re still in the realm of counting and summing (and not statistical ETL), we’ve greatly complicated the problem because the counting and summing must be held in arrays distinguished by page type. To illustrate what I mean, consider a session like this:
- Home Page
- Category B View
- Category B View
- Category C View
- Category B View
- Category C View
- Category D View
- Category D View
- Category B View
For this session, we need to be able to sum views by category and we need to be able to recognize when we have a new category we haven’t already encountered. That means the ETL needs to hold an array of some form (probably a hashtable) of all the category types it’s encountered as it processes a section and it has to be able to classify pages into categories. Neither of these things is very hard programmatically, but things like governance around the categorization can often present difficult real-world challenges.
It’s an interesting question to consider just how lossy this model is. As the example stream of pages I’ve given above illustrates, we’ve still sacrificed lots of pattern information and a certain amount of time information. That’s problematic for some types of analysis, but not critical for others. And, of course, we’ve also sacrificed the page detail information on all of these aggregations.
Keep in mind, though, that what we've created isn't a cube. It's a table with 1 record per session.
I also happen to believe that this representation of data still makes it very trying to work through certain types of queries and questions. To show why that is, I’ll next outline how our concepts of 2-tiered segmentation lead to more statistically-based ETL and a potentially cleaner data model for many types of queries.