IBM and Semphonic just partnered on a new Whitepaper
tackling one of the hottest and most challenging topics in digital analytics –
choosing the right big data technology stack. I finished it a couple of weeks back and it’s now gone into
general release. In addition, I’m going to be doing a
webinar about it with IBM’s CTO of
Big Data Solutions, Krishnan Parasuraman.
I’m very excited about both.
In the Whitepaper, I got to combine some of the big themes
that have been emerging in our practice: the unique challenges of digital
analytics for traditional statistical and database methods, the impact of those challenges on the selection
of a technology stack, and the best ways to structure a digital analytics
technology initiative to address the issues and build an effective digital big
data solution.
Over the last twelve months, Semphonic has been incredibly
active in this area. We never used to focus that much on strategic
measurement engagements. But the confluence of Big Data and Digital Analytics
has changed that. With our extensive background in database marketing, we’re
comfortable (indeed, eager) to get our hands on the detailed customer data and
the database, BI, and statistical tools that support that deep access. We’ve
had fifteen hard years trying to figure out how to measure, segment, and use
digital data effectively. We’ve also seen first-hand how easy it is to break
traditional technology stacks with digital data, having done it repeatedly! That combination of big data technology and digital measurement chops is pretty unique, and I think that’s why we’ve been getting asked so
often to help large enterprise’s craft a strategy that blends these elements
effectively.
In the Whitepaper, I've tried to distill that experience down
into a useful framework for thinking about digital marketing analytics
in a big data world.
So Just What is Big
Data?
The Whitepaper starts with a pretty deep discussion of the
challenges of digital and why digital is a paradigm case of big data. I know
people are already starting to hate the term big data, and I don’t really blame
them. In the broader market, it doesn’t have a specific meaning. It’s lots of data. We get
that. But how much data is big data? And
why does having lots of data really change anything?
I try to tackle this definitional morass in the Whitepaper.
At Semphonic we’ve come to have a pretty specific view about what big data
means and why it really is somewhat different – not just “more rows than
normal.” We believe that big data is really about a drive to “detail” data and
to algorithmic analytics techniques that don’t work off of aggregates. Yes, volume does
count. But big data isn’t just big, it’s big because we’ve shifted the level of
analysis.
This shift to detail-level analysis has a much bigger impact
than you might suppose. From a technology standpoint, it does drive more row
volume. But from an analysis perspective, it makes many traditional BI
techniques (that depend on cube-based aggregates) impossible or irrelevant.
In digital, it has even deeper implications. Which brings me
to the part of the Whitepaper that I think is the most interesting and
important.
The Challenge of
Stream Data
You’ll often hear digital data described as “unstructured.”
I think that’s wrong (at least in part). Yes, social media data is truly
unstructured. But analytics data collected from the Web and Mobile channels is
certainly structured. The SiteCatalyst data-feed (our most common source of
this information) is just a classic, big, comma-delimited flat file with 400 or
so fields per row. Structure!
In fact, almost every digital data source except social is
structured data.
So why this persistent description of digital data as
unstructured?
Well, digital data does drive IT folks and data architects
crazy. But it’s not the lack of structure that does it, it’s the level of
meaning.
In most digital data, there’s no meaning inherent in a
single detailed row. The server call (or page view) is not, on it's own, the unit of analysis.
Worse, digital data doesn't aggregate cleanly. Adding server calls to create page view counts or time on site isn't, in most cases, the path to meaning. Meaning comes by interpreting a stream of server calls (on the Web this is a
Visit or Path). So digital data is (mostly) semi-structured. Each row is structured just fine, but to get to anything interesting requires interpretation (effectively the addition of structure).
Why is this important?
The vast majority of ETL, query and statistical analysis
techniques have been built to operate on individual rows. That doesn’t work in
digital. In digital, meaning exists only in the combination of multiple rows
(paths) and that combination isn't a straightforward aggregation.
Stream data create a second big problem. Stream data defeats classic
join strategies. One-to-One and One-to-Many joins are almost the only types of
joins ever used in classic database work. With streams, you get Many-to-Many
joins. Many-to-Many joins don’t work well.
We’ve seen a number of cases where our clients dump digital
data streams into a warehouse, find join keys, and think they are done. In a
traditional world, putting two data sources on the same box with a join key
makes it easy for an analyst to put them together. In a stream world, it doesn’t quite solve
the problem.
In the Whitepaper, I take a real deep-dive into this topic
because I think it is, quite simply, the key to understanding the challenge of
digital big data warehousing.
Translating Problems
into Solutions
It’s nice to have a good definition of big data. It’s
certainly interesting to know why digital data is such a challenge. But how
does that knowledge translate into a useful framework for moving forward?
Well, that’s the third part of the Whitepaper. Because once
you understand some of the unique challenges of big data analysis and digital,
you can start to map different applications of digital to specific attributes
of different technology stacks.
In the Whitepaper, I look at a whole set of different
decision factors (from handling very large row counts, to supporting
algorithmic queries, to real-time analytics, to the availability of expertise) and
match them to another set of digital marketing use-cases (things like email
Targeting, Personalization, Customer Analytics and Attribution).
Not every digital marketing application has the same
requirements or puts the same stress on the technology decision factors. So if you know what
types of digital marketing applications you have, the Whitepaper gives you a
great framework for evaluating what types of technology capabilities you need.
If you’re at the point a lot of our clients are, you know that the
range of new technologies and big data capabilities, while welcome, make
choosing the right approach harder not easier. It can just be too many choices.
Without a way to think about which trade-offs are appropriate (and believe me,
EVERY technology has trade-offs), making a decision can feel random.
Yes, IBM has put together a pretty comprehensive big data
solution set. It will probably be on just about any enterprise short-list for
big data. But our (both Semphonic and IBM’s) goal in this Whitepaper wasn’t to
evaluate the IBM solution or even to talk much about it. It was to lay out a
way for ANY organization evaluating ANY big data technology stack to think more
clearly about what’s needed and why.
Download it here! Register for the webinar here!
Recent Comments