More Thoughts on Analyzing the User Journey
I saw an article this past week in the local rag touting a bold new initiative in social media to localize advertising by country. In the featured case study, a company had targeted different messages by gender and country. Imagine, personalization at the 50% of the country population level! Wow. That's kind of like walking around at a conference, shaking people's hand and saying, "Hello girl" or " Hello boy".
This is 2015, right? I didn’t fall asleep last night and wake up in 1995? Are we really bragging about digital (digital!) media being targeted at the gender level?
I'm afraid so. The sad truth is that this level of personalization is still better than the median for digital targeting and is apparently good enough to get you front of Business-page coverage.
You can – you should - do better. And doing better starts with the recognition that segmentation and personalization in digital are largely behavioral not demographic. This whole series on digital data models may seem technical and far from the concerns of targeting and personalization. It isn’t. Building a powerful, usable model of digital data is the key to opening up real opportunities for better targeting and personalization.
In a little more than a week I’m going to be out at eMetrics speaking on big data and data modeling. The first part of my presentation is one I’ve given before – an overview of what makes big data different. It isn’t the four V’s and it isn’t all hype. The key to understanding big data is to realize that when you drop to a certain level of detail, you change the analytics paradigm. At a highly granular level, individual records no longer have specific meaning. This makes certain kinds of analysis harder, but it opens up new types of analysis that cannot be done at all when data is collected at a higher level. These new forms of analysis nearly always involve combining individual records to understand how they fit together – and this necessarily involves keeping track of and analyzing the sequence, time between, and pattern that joins individual records into more meaningful aggregations.
What makes this challenging is that our traditional database and statistical tools aren’t focused on this type of operation. Using SQL, it’s insanely hard to understand sequence, time, and pattern. The language just isn’t built for solving those types of problems. Statistical analysis tools are better, but they too are heavily geared toward analytic techniques where the unit-of-meaning is the individual record. What works for analyzing customer records just isn’t appropriate for analyzing machine-generated data from a Website, a wearable or a smart-meter.
Which brings me to the second part of my eMetrics presentation and one of the keys to this series – how can you efficiently capture sequence, time-between and pattern in data structures that support analysis?
In my last post, I described a method of using an abstract journey map to create a data model that captured where a user was in the journey. That’s a powerful technique, but it’s a technique that’s better suited to delivering a ready-to-use targeting data set than a data set appropriate for further analysis. If you’re charged with creating a data set to be used by analysts not marketers, are there ways to preserve sequence, time and pattern at some level above the detail and in a fashion that makes analysis with tools like SQL easier?
In my eMetrics presentation, I outline four such methods. One of those (chaining) I’ve already covered in my discussion of real-time data structures. Here are three more alternatives, ranging from very simple to quite complex:
Milestoning is simple in both concept and execution. The basic idea is that you timestamp every time a user passes a significant milestone in a journey – kind of like a coach with a stopwatch clicking a time at every lap. In a typical conversion funnel, a user might have a page sequence that looks like this: Page 1, Page 2, Page 1, Page 2, Page 3, Page 4, Page 3, Page 2, Page 3, Page 4, Page 5. Instead of recording every page view and the time between, you simply record when a user first hits each milestone. With a milestone approach, you can represent a very complex journey with a sequence of timestamps. What’s more, you can fold those timestamps into a single record. The data model for a conversion funnel with five steps can be captured with five datetime fields per user. In each field, you simply hold the timestamp for when the user first hit the field. When a user completes the process, you record the conversion in RFM (Recency, Frequency, Monetary) fields and reset all the timestamp fields to null. Milestoning isn’t a solution for every problem. But here are some of the types of queries it can trivially support in a SQL like language with very high performance:
- Where is a user right now in their journey?
- How long has the user been in that state?
- How long has the user been in the journey?
- What’s the average time it takes to complete the journey?
- What’s the average time at each step for successful journeys?
- What’s the average time at a step for failed journeys?
- What’s the drop rate for every step?
Not bad for a ridiculously simple structure that’s easy to ETL and is very space efficient. Milestoning isn’t limited to page funnels, you can use this same technique to capture an omni-channel customer journey very effectively.
My next technique is more often the bane of a good data model than a solution – I don’t have a great name for it but I’ll call it vectorization. Vectorization uses a sequential, delimited list of events and times saved as a varchar (a string of variable length). For example, the sequence of pages in a conversion funnel would be stored as a string with a set of formatted pairs of page names and datetimes. You can vectorize almost anything – pages, purchases, campaigns, journey steps, etc. The advantage of vectorization is that you can capture both simple and very complex journeys in a single field inside a single record. It’s a very common technique for storing highly variable data and making it available to a relational system.
Of course, it’s also a really crappy solution. If you’ve ever tried to use vectorized data stored in a varchar, you know how challenging it is. Indeed, it’s fair to say that this technique is a dodge not a solution at all. But I’m NOT suggesting it as a strawman. I actually think it can be useful.
The key to making a vector solution useful isn’t the way you store the data – it’s the way you expose it. If you vectorized a set of behaviors into a field, you should create a set of user-defined functions (UDFs) that provide rich access to the data. User-defined functions are typically written in a full programming language (C++, Java, etc.) and they allow a SQL programmer to take advantage of high-performance and complex manipulations of data fields.
A typical UDF library for a vectorized field should provide functions that allow a SQL programmer to seamlessly use the field to identify things like:
- Contains a Page
- Time in Process
- Process Completed
- Last Step Completed
- Step Times for Step X
- Page Before X
- Page After X
- Success After X
- First Position of X
- Last Position of X
- Time Before X
- Time After X
- Time Between X and Completion
Writing these UDFs isn’t really very hard. Given consistent and reasonable formatting of the elements in the string, a competent programmer should be able to build a full UDF library in a few days. Once that’s done, your SQL programmers suddenly have easy, very fast, very rich access to a wide array of queries on vectorized fields.
The two techniques I’ve discussed so far are so simple they are almost mundane. Not so my third technique – graphing. The idea behind graphing is that the patterns that underlie a user journey may be easier to cull out and analyze visually than with traditional numeric techniques.
To support a graph, you have to create a set of rules that map the user journey into visual components. For now, let’s assume you want to represent a user-journey as a colored bar. Here’s how graphing can work.
Each user’s bar will represent their journey and be made up of a set of segments. Each segment will be defined by the time they spent in that part of the journey – which will represent the length of that segment of the bar. The color of the segment will be based on the journey step type. And the transparency or brightness of the color will be determined by their measured intensity of engagement in that step.
What’s the point of graphing journeys? There’s a rich set of analytic techniques designed to find matching images. That’s essentially a look-alike tool and look-alikes are one of the most common and powerful targeting techniques around. But creating look-alikes of journey structures isn’t easy. Is someone who stopped on the same step a look-alike? Not necessarily.
If you’ve graphed user journeys, you can take advantage of those image matching techniques to find similar user-journeys and identify patterns that might be almost impossible to find using traditional techniques.
I’ve outlined a very simple visualization of a user-journey, but it seems to me this technique might support more complex visualizations and styles as well. Map-based techniques, heat-maps and area plots might all prove interesting ways to visualize and visually analyze a journey.
Make no mistake, graphing is just another type of aggregation technique and it’s actually a pretty lossy one. However, it’s really good for capturing sequence, time and pattern and it provides support for a type of analytics (image processing) that is heavily geared toward the identification of patterns in data.
If you’re going to be at eMetrics on the 31st, I hope you’ll stop by, catch my presentation, and say "hello boy".
See you there!