Overview
Sophisticated organizations are increasingly finding good reasons to move data from their web analytics tools to other data processing and analysis platforms. In this series, I’ll be discussing the “why” and “how” of taking data from your web analytics solution and moving it into other platforms. This second installment will cover interest profiling.
In the first installment of this series, I covered some of the reasons why companies increasingly want to move data out of their web analytics solution. When they do so, the biggest challenge they face is finding the right data models to let them use the data effectively. Web analytics data is much too large to use conveniently in just any format – so you almost always have to do some significant aggregation on it. This is as true inside the web analytics tool as it is outside. But the type of aggregations you’ll probably want outside the tool are very different that the choices made by vendors to support reporting inside the tool.
The aggregation methods (or data models) I’m going to
discuss today are relevant to several different uses of web analytics data:
joining web data to secure customer data, driving actionable systems and
supporting analysis in more sophisticated statistical tools.
Interest Profiling
With interest profiling, the goal is to capture the depth and distribution of your visitor’s interests. For most web sites, this will be “content” interest. For media/publishing sites, forums, public sector sites, public financial services sites, health and pharma sites, and many other types, one of the most interesting facts about your visitors is how much and what type of content they looked at.
In capturing this information, you often have to make a
trade-off between granularity and convenience of data usage.
On a large media site, for instance, the top-level site may be broken up into four or five basic categories. For a really general purpose site the categories might be things like news, sports, and entertainment. For a sports site, these top-level categories might be football, baseball, basketball, etc. In each case, however, these categories will have interesting sub-categories. And these sub-categories will usually have interesting sub-sub categories.
The farther down you drill, the more precise is your
categorization of interest and the more data and categories you have to contend
with.
In general, what you are looking to capture is the depth and extent of content interest. The most obvious ways to represent that are by total views or total time per content area:
News: 22 Views
Sports: 38 Views
Entertainment: 7 Views
Science: 9 Views
I’ve found that using this data, you often want to understand
issues around mind-share: questions like ‘Which content area is a visitor most
interested?’ are mindshare questions. But you also want to understand actual
levels usage: ‘Which visitor’s had more than 20 views in sports?'
If your data model captures all content Categories, then saving the raw numbers allows you answer either type of question. If you want to save the data as percentages (News is 27% of Views), you’ll need to also save a number like "Total Content Views" to be able to answer either type of question.
Though there is some level of redundancy, I'd recommend
providing views to the data that capture both ways of thinking. It’s surprising
how often users of the data won’t think about mindshare if the only fields
represented to them are raw counts.
It’s usually easy enough (and appropriate) to allocate a column per visitor for each high-level content area on your site. But that approach won’t work if you want to make the data model for content interest more granular or if you happen to have an unusually high number of top-level categories.
If your interest is heavily focused on what a visitor is
most interested in, you can flip the model around a little bit and make your
columns generic. In other words, you can have columns like “Top Interest,” “Second
Interest,” etc. These columns will contain the “content type” that fits the
description – so for one visitor it might be “news” and for another the data
value in “Top Interest” might be “sports.” This approach doesn’t support questions
like ‘Which visitors had more than 20 views in sports?’ but it does support any
number of highly granular interest categories. If you are taking this approach,
you’ll probably want to capture the usage in each area as well. So your data might
look like this:
Top Interest/% of Content Views: Sports, 45%
2nd Interest/% of Content Views: News, 27%
3rd Interest/% of Content Views: Science, 11%
With this model, you’ll also need to capture total content
views. One nice thing about this approach is that it can be replicated at
multiple content levels.
A problem with this method is that, in and of itself, it doesn’t support a running tally. If you have a database with “Top Interest” in it, and you are updating your data, you have of no way of knowing when a category that wasn’t in the list might now belong.
It’s probably essential that somewhere in your model you
keep the data in a fashion that allows you to build these running tallies. This
may not be how you load the data into an SPSS or how you give to a user in a
view, but it really does have to exist.
A different (and possibly complementary) approach is to completely pivot the data and create a separate data structure that is based on the content categorization hierarchy. To do this, you’d build a table where the primary key is the lowest-level content categorization your interested in plus a visitor id. You’ll also need to capture the hierarchical structure of the contents. Each data row of the content categorization would have the content category, the visitor id, and a count. This structure isn’t nearly so easy to deal with as a visitor based aggregation, but it can support very powerful content/visitor analysis.
Not every site is interested in the “content” of a page. For
some types of sites, it’s much more important to understand the function of a
page. Many operational sites are this way. If you're measuring a trading site,
an online banking site, or a public-sector service site, then your primary
interest in user profiling will be to capture data about the quantity and type
of functions a user is performing (e.g. trading, portfolio management,
planning, researching, etc.). This type of profiling is essentially the same exercise
but with a different type of overlay.
For some sites, both categorizations may be interesting. When you do have both, they should usually be kept separate in your data model though content categorizations are sometimes a sub-category of functional classifications (“retirement” and “college planning” might be interest categories under a functional category of “planning”).
It’s also important to realize that the type of Behavioral
Segmentation I discussed at length in a previous series is one of the best data
aggregation tools you can possibly have. By applying a behavioral segmentation
to a visitor, you create a single column of data (the segment code) that
captures a combination of interest, functional and even non-behavioral
characteristics.
Capturing a reasonable snap-shot of what a visitor has looked at by content and function can support a wide range of business applications. Interest profiling is useful for driving offline marketing, personalization, campaign targeting, sales support, call-center operations support and analysis, and much more. It's a simple, efficient way to capture a great deal of interesting visitor information in a very compact and usable form.
So far, however, none of these models capture one of the most important facts about a visitor’s online behavior – how a visitor’s behavior is changing over time. In the next post, I’ll take up a discussion of magic moments and some simple techniques for capturing a short-hand view of key facts like “most recent interest,” “new interest,” and “changes in interest.”

Comments