In my last post on digital data models, I covered some alternative approaches to thinking about visits and units of work. Today, I’m going to jump to something rather more arcane – a potential structure for real-time personalization support. You might well ask, “Is that the logical next step after sort of figuring out the visit?”
No, not really…not even close.
But when I introduced this series I pretty much admitted that I didn’t intend to approach it as a perfectly logical and organized set of posts. I’ve been thinking about real-time personalization and so that’s what I’m going to write about. I promise, I’ll circle back and talk about some more plausible intermediate digital data structures around things like customer journey, merchandising, page optimization, and attribution. But for today, it’s data structures for real-time personalization.
The key to real-time is performance. Because real-time is really hard. You have to be able – in sub-seconds – to integrate what you know about a visitor (known or anonymous) with what they are doing right now to drive near instantaneous decisions about what to say next. The most common types of real-time personalization ignore at least half this equation – they use pre-built models and segmentations to drive real-time personalization with no regard to what the user just did. In the digital world, that just isn’t optimal. What a customer just did – that last page(s) they viewed – is often a critical component of knowing what to say or offer right now.
Of course, some personalization systems take the opposite approach. They use what the customer just did but ignore everything else when building a recommendation. Do I have to say how in-opportune that approach is?
It’s not that there aren’t times when these half-way approaches won’t work okay. For certain types of personalization they can produce good results. Most of the time, though, real-time personalization demands the integration of who a customer is with what they appear to be doing right now. Creating that integration in a manner that supports near instantaneous access is the challenge.
In-memory systems are the best way to answer that challenge. In-memory capacities have grown incredibly in the last few years, right alongside MPP (I was going to say traditional MPP – how hilarious is that) systems. For ETL (especially) and for analytics across huge chunks of detailed data, MPP systems are still the right and maybe the only answer. But when it comes to real-time personalization, you’re trying to support a very different type of task. You want to be able to instantly look up a specific visitor and combine their information with real-time streams of data to create personalization decisions. To do that, in-memory is ALWAYS going to be much, much faster than on disk.
When a visitor shows up on your Website, the only thing you are guaranteed to know about them is the visitor cookie (which may be new and may not correctly tie them to past sessions). So the first structure you need in-memory is a cookie pool.
The cookie pool contains three fields. The first field is the cookie. It contains every persistent cookie you’ve issued and it’s the key field for this structure. The second field is a pointer to the current known customer data structure (more on this in a second). The final field is a pointer to the current event record and may be null if no current session exists.
That’s it – three fields.You want this structure to be as small and efficient as possible.
Let’s say you use a 25 character alphanumeric cookie. Each pointer will be an additional 8 characters. That means each record in this structure will occupy 41 characters of space. If you have 60 million extant cookies, that gives you 2.5 gigs of data. However, you’ll want probably want to organize this as a hash table because that will be the most efficient way to find each cookie record as you process. A hash table needs quite a bit of open storage to work effectively (you’ll have relatively exact knowledge of the number of entries so you can pick an appropriate load factor and total bucket size to optimize performance) and it also requires some extra storage to resolve collisions (when two entries hash to the same value), but it means you will be able to access any given record in the structure with almost no lookups or scans. You simply hash the cookie value and look-up the resulting index in the hashtable to retrieve the record. That’s super fast.
Now let’s talk about those other two fields - the pointers. The first pointer is to the current visitor record for that id. A pointer is a variable that points to a specific location in memory – in this case a structure or object that encapsulates the visitor data. Because we are storing a pointer, we can access the underlying data with ZERO scans or lookups. Access to the visitor data is essentially instantaneous once you’ve accessed the cookie pool record.
Of course, you can’t load all your detail data in-memory and you wouldn’t want to anyway. What you need to have is a trimmed down customer level record that contains the core information you want to integrate into your personalization decisions. Typically, this will include demographics, segmentation and model codes and scores, key behavioral flags, key relationship flags, and high-level RFM information about their journey. This data is semi-static, but it gets refreshed in two ways. First, as you process real-time data, you may want to update key fields and flags in the data. Second, you’ll want to have a procedure for updating the static data with periodic refreshes from the customer data master source.
How much memory will this data take? We use pointers to access the data directly, so we don’t need a hash and there’s no load factor. Plus we only need records for extant customers not cookies (there may be many records in the cookie pool that point to the same visitor data). Let’s say you keep 20 demographic fields at 2 characters each, 20 model scores at 4 characters each, 30 additional flags at 1 character each, and RFM journey information of four fields (at 4 characters each) per use-case with 12 different use cases. That’s a heckuva lot of information about the customer and it adds up to about 350 characters per customer. If you have 20 million extant customers that means you’ll need around 7 gigs of data for your customer profile data in memory.
This combination of a hash-table cookie pool with pointers to customer records solves half of the data model problem for personalization. We’ve got super-fast access to who a customer is, where they are in their journey (the use case RFM data), and what types of scores our modelers have created for them. However, we still don’t know what they've just done!
We fix that with our third field in the cookie pool – the current event pointer for this cookie. Let’s say this is a new session. When a visitor arrives on site and you look up their cookie in the cookie hashtable, the current event pointer will be null. You create a new in-memory data structure or object for this action. This structure is quite simple. It should contain the current action and key details (page name, for example), the timestamp, and a pointer to the previous action. For the first record in a session, this pointer will be empty. Depending on how much detail about an event you want to store, this data structure will vary in size. Let’s give it a good round 200 characters. You’ll need one of these for every action in every extant session. Since we’re storing pointers, there’s no load factor or extra overhead. So if you have 1 million open sessions at peak times and those sessions average 11 pages, you’ll need about 2 gigs for all your detailed current event data. Note, too, that there is absolutely no reason here to stick to that arbitrary 30 minute session window we kicked around in the last post – you can keep a visitor extant for as long as you think makes sense for personalization purposes.
Here’s the beauty of this structure. When you first get a session, you update the in-memory record in the cookie-pool so that the current event record contains a pointer to a new event record. That event record contains the current page (the homepage for example), additional detail like the campaign code, the timestamp, and a pointer to the previous record. For our first event that pointer to the previous event record is null or empty.
When the visitor goes to a subsequent page, you will simply create a new event record and update the cookie pool to point to that record. The pointer to that event record will be stored into the cookie pool’s third field (the current event pointer) and new event record will contain a pointer to the event detail record that was created last and was previously stored in the cookie pool. This creates a perfect chain from the most current event all the way back to the very first event in a session.
When you need to make a personalization decision, you have access to the visitor record and you can chain through the stack of recent events directly from the most current event pointer in the cookie pool. You start with the most current event and it contains a pointer to the previous event which contains a pointer to the previous event and so on until you hit an event where the previous event point is null. Because these are pointers, there is no lookup time and because this is all in-memory, it will take an infinitesimal amount of time to walk through even a complex chain of fifty or one-hundred pages.
If we use 10 gigs for our cookie pool, that means we require about 20 gigs of memory total to handle 20 million customers, 60 million extant cookies, and 1 million peak sessions with an average of 11 events per session and we can deliver nearly instantaneous lookup of both customer profile data and a complete chain of current session events to drive personalization.
20 Gigs – it’s nothing!
Of course, I’ve simplified some things for the discussion here. I haven’t considered session closing and there is considerable complexity that must be solved in id lookup and the cookie pool. One of the huge advantages to doing everything in-memory is that you can do writes every bit as fast as you can do reads. So it’s perfectly possible to update the visitor records in real time and combine visitor records together when cookies can be resolved. You can even chain web and mobile records together in real-time to drive personalization and the way I've laid out the cookie pool makes this easy to do whenever you can actually resolve keys across devices. How cool is that!
The stripped down version I’ve laid out here is the way a programmer might model a custom real-time personalization engine. If you use a full in-memory database you’ll gain many advantages (and probably sacrifice both some speed and some memory) over this bare bones approach. I’ve laid this out simply to illustrate that with today’s technologies and access to systems with huge amounts of RAM, it’s not really that hard to create in-memory systems that can deliver integrated customer profiles and complete event data to a personalization engine in the micro-seconds necessary to support true real-time personalization.
I know this has been a super-technical post. I have to admit that it feels kind of good to be down in the weeds again thinking like a programmer. But if you didn’t follow every part of the logic behind the data model here, don’t despair (and if you have better, cleaner approaches then have at it). The important points are these:
- Real-time personalization should be driven by both visitor profile and current event data.
- The likeliest way to support the need for profile and current event data is by keeping your visitor profile and current event data in memory.
- With today’s huge amounts of in-memory storage, you can handle pretty massive volumes of data and still get nearly instantaneous access to large amounts of profile data and complete event histories to shape your personalization decisions.
- The basic structures I’ve proposed here – the cookie pool, the visitor profile, and the current events chain will be largely applicable regardless of how you choose to instantiate your in-memory system.
Real-time personalization is not a pipe-dream. We have the technologies to deliver intelligent customization at massive scale with no Website performance implications. Let’s use it!
Comments