My Photo

Clicky

  • Clicky Web Analytics

Your email address:


Powered by FeedBlitz

« Warehousing Web Analytics Data - Understanding the Problem | Main | Warehousing Web Analytics Data - Capturing Usage Trends »

Warehousing Web Analytics Data - Interest Profiling

Overview

Sophisticated organizations are increasingly finding good reasons to move data from their web analytics tools to other data processing and analysis platforms. In this series, I’ll be discussing the “why” and “how” of taking data from your web analytics solution and moving it into other platforms. This second installment will cover interest profiling.

In the first installment of this series, I covered some of the reasons why companies increasingly want to move data out of their web analytics solution. When they do so, the biggest challenge they face is finding the right data models to let them use the data effectively. Web analytics data is much too large to use conveniently in just any format – so you almost always have to do some significant aggregation on it. This is as true inside the web analytics tool as it is outside. But the type of aggregations you’ll probably want outside the tool are very different that the choices made by vendors to support reporting inside the tool.

The aggregation methods (or data models) I’m going to discuss today are relevant to several different uses of web analytics data: joining web data to secure customer data, driving actionable systems and supporting analysis in more sophisticated statistical tools.

Interest Profiling

With interest profiling, the goal is to capture the depth and distribution of your visitor’s interests. For most web sites, this will be “content” interest. For media/publishing sites, forums, public sector sites, public financial services sites, health and pharma sites, and many other types, one of the most interesting facts about your visitors is how much and what type of content they looked at.

In capturing this information, you often have to make a trade-off between granularity and convenience of data usage.

On a large media site, for instance, the top-level site may be broken up into four or five basic categories. For a really general purpose site the categories might be things like news, sports, and entertainment. For a sports site, these top-level categories might be football, baseball, basketball, etc. In each case, however, these categories will have interesting sub-categories. And these sub-categories will usually have interesting sub-sub categories.

The farther down you drill, the more precise is your categorization of interest and the more data and categories you have to contend with.

In general, what you are looking to capture is the depth and extent of content interest. The most obvious ways to represent that are by total views or total time per content area:

News:  22 Views

Sports: 38 Views

Entertainment: 7 Views

Science: 9 Views

I’ve found that using this data, you often want to understand issues around mind-share: questions like ‘Which content area is a visitor most interested?’ are mindshare questions. But you also want to understand actual levels usage: ‘Which visitor’s had more than 20 views in sports?'

If your data model captures all content Categories, then saving the raw numbers allows you answer either type of question. If you want to save the data as percentages (News is 27% of Views), you’ll need to also save a number like "Total Content Views" to be able to answer either type of question.

Though there is some level of redundancy, I'd recommend providing views to the data that capture both ways of thinking. It’s surprising how often users of the data won’t think about mindshare if the only fields represented to them are raw counts.

It’s usually easy enough (and appropriate) to allocate a column per visitor for each high-level content area on your site. But that approach won’t work if you want to make the data model for content interest more granular or if you happen to have an unusually high number of top-level categories.

If your interest is heavily focused on what a visitor is most interested in, you can flip the model around a little bit and make your columns generic. In other words, you can have columns like “Top Interest,” “Second Interest,” etc. These columns will contain the “content type” that fits the description – so for one visitor it might be “news” and for another the data value in “Top Interest” might be “sports.” This approach doesn’t support questions like ‘Which visitors had more than 20 views in sports?’ but it does support any number of highly granular interest categories. If you are taking this approach, you’ll probably want to capture the usage in each area as well. So your data might look like this:

Top Interest/% of Content Views: Sports,  45%

2nd Interest/% of Content Views: News, 27%

3rd Interest/% of Content Views: Science, 11%

With this model, you’ll also need to capture total content views. One nice thing about this approach is that it can be replicated at multiple content levels.

A problem with this method is that, in and of itself, it doesn’t support a running tally. If you have a database with “Top Interest” in it, and you are updating your data, you have of no way of knowing when a category that wasn’t in the list might now belong.

It’s probably essential that somewhere in your model you keep the data in a fashion that allows you to build these running tallies. This may not be how you load the data into an SPSS or how you give to a user in a view, but it really does have to exist.

A different (and possibly complementary) approach is to completely pivot the data and create a separate data structure that is based on the content categorization hierarchy. To do this, you’d build a table where the primary key is the lowest-level content categorization your interested in plus a visitor id. You’ll also need to capture the hierarchical structure of the contents. Each data row of the content categorization would have the content category, the visitor id, and a count. This structure isn’t nearly so easy to deal with as a visitor based aggregation, but it can support very powerful content/visitor analysis.

Not every site is interested in the “content” of a page. For some types of sites, it’s much more important to understand the function of a page. Many operational sites are this way. If you're measuring a trading site, an online banking site, or a public-sector service site, then your primary interest in user profiling will be to capture data about the quantity and type of functions a user is performing (e.g. trading, portfolio management, planning, researching, etc.). This type of profiling is essentially the same exercise but with a different type of overlay.

For some sites, both categorizations may be interesting. When you do have both, they should usually be kept separate in your data model though content categorizations are sometimes a sub-category of functional classifications (“retirement” and “college planning” might be interest categories under a functional category of “planning”).

It’s also important to realize that the type of Behavioral Segmentation I discussed at length in a previous series is one of the best data aggregation tools you can possibly have. By applying a behavioral segmentation to a visitor, you create a single column of data (the segment code) that captures a combination of interest, functional and even non-behavioral characteristics.

Capturing a reasonable snap-shot of what a visitor has looked at by content and function can support a wide range of business applications. Interest profiling is useful for driving offline marketing, personalization, campaign targeting, sales support, call-center operations support and analysis, and much more. It's a simple, efficient way to capture a great deal of interesting visitor information in a very compact and usable form.

So far, however, none of these models capture one of the most important facts about a visitor’s online behavior – how a visitor’s behavior is changing over time. In the next post, I’ll take up a discussion of magic moments and some simple techniques for capturing a short-hand view of key facts like “most recent interest,” “new interest,” and “changes in interest.”

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83454a6d169e2010535e675f3970c

Listed below are links to weblogs that reference Warehousing Web Analytics Data - Interest Profiling:

Comments

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.