I can’t hope to answer all of those questions in a single post nor is it reasonable to think there’s just one answer for every situation, but I did want to outline what I think were some of the most interesting discussions and conclusions that I took away from X Change around the shape of analytics technology.
Lesson #1: Stay out of my Sandbox
When an organization builds out a technology stack around analytics, one of the key decision points revolves around the role of the primary analytics platform. On the one hand, there’s a strong belief in the analytics community that the analytics platform needs to be a “sandbox” – by which we mean that it needs to be a place entirely under the control of the analytics team unrestricted by formal schemas, change requirements, limitations on queries, and all the other maddening apparatus of IT governance. Many of our clients have struggled as they operationalized analytics and began to build reports or production jobs on the analytics platform. Not only do those production jobs often go awry (because the environment is poorly controlled), but the problems that arise often lead to crippling restrictions on the use of the analytics platform. So the need for a sandbox is real. On the other hand, the whole point of doing analytics is to operationalize the results. If your analysts come up with a great segmentation, you need to be able to use that segmentation. This conflict between the need for a sandbox and the need to operationalize analytics is a big deal. This is one issue where I feel like I'm in the mainstream of the analytics community: the need for a “sandbox” environment with all that implies is real. So is the need to operationalize out of (but also OUTSIDE of) that environment. Put the two together, and there’s a clear need for a well-defined process of moving analytics into reporting and operational systems and, in most cases, a clear case for systems both below and beside the analytics platform to support reporting and personalization systems.
Lesson #2: Programmers Wanted
As I discussed in my post on data science, the analysis of digital data (and other big data analytic areas) requires a different kind of analytics – an analytics where the order of events, the time between events, and the pattern of events are the significant components of analysis. Where order, time-between, and pattern are significant, tools like SQL are poor performers. While SQL can accomplish (almost) anything, it can be very challenging to code certain types of queries in SQL. I’ve seen skilled SQL programmers spend days constructing a massive query that could be duplicated in about 30 minutes and will run much faster with C#. That isn’t because the SQL guys aren’t sharp and it certainly isn’t because SQL is a poor tool. But like any tool, SQL has its strengths and weaknesses. It isn’t the best tool for every job. It isn’t a good choice for building custom analytics and it isn’t a good choice for stream ETL.
SQL and standard ETL aren’t the only casualties. Big data analytics will often demand significant modeling that isn’t available in canned routines in SAS or SPSS. Oh, and I don’t think that most machine-learning algorithms are right for the job either. This is really important. It means that much of the interesting analytics you’re going to want to do on a big data system will have to be built from the ground-up.
So what are the good options for all this ground-up work? Frankly, I’m not sure there is an alternative to full-on programming languages.
I know that isn’t great news. Programming is expensive, error-prone and difficult to maintain. It’s also a fairly uncommon skill among analysts. All bad things.
But we have to deal with it. Programming is going to be a critical in-team (non-offshore) skill. Big data applications routinely need programming skills in both ETL and analysis, and I don’t see that changing any time soon.
If you’re thinking about technology platforms, it’s worth noting that not every language is supported on every platform and, depending on your history and preferences, C++ may look better or worse than C# or Java or Python.
Lesson #3: What, my massively scalable big data system isn’t good for reporting and every kind of analytics?
It seemed like almost every enterprise at X Change that had invested heavily in Hadoop systems in the last 12-18 months had come to a very similar conclusion: some kind of analytics and ETL are great in these systems, but other kinds not so much. And when it comes to driving reporting and visualization off of these systems, it has been difficult or flat-out impossible. A lot of this, in my opinion, comes back to what makes big data challenging in the first place. When you’re doing the right kind of analytics (order, sequence, pattern based), these are the right kind of systems. When you’re not, they aren’t necessarily an improvement. Not every problem is a big data problem and not every kind of analytics works better on a big data platform. Particularly when it comes to reporting, significant aggregation is nearly always essential to deliver adequate performance and that hasn’t yet changed. Most shops seem to have ended up using their big data environment as a new, highly flexible way to generate cubes for reporting. That’s a bit disappointing, but welcome or not, that seems to be place we’re at.
And speaking of big data, the Conference season comes to a close for me in early November at IBM’s Information On Demand Conference where I’ll be presenting my theory of “big data” (in 20 minutes or less). I love this presentation. If you’re there, check it out!