If you translate IT analytics professions into their Hollywood counterparts, analysts are the stars and big data architects are the directors. It’s great to be in the leading roles. But you can’t make a movie with just stars and a director. Somebody has to build sets, manage the lighting and roll film. Our IT analytics equivalent is ETL – the difficult but totally essential process of moving data and manipulating it into the highly usable forms specified by the architects and demanded by the analysts. In this extended series on big data, I’ve so far concentrated on the essential role of data models even in a big data world and understanding what those models should be. This is, I’ll admit, far more interesting than talking about the manipulations of the data necessary to produce them. But I’ve seen more than one organization get hung up on the ETL – especially in this new big data, Hadoop based world.
ETL has always been an important and largely unsung part of the broader data warehousing world. But there are several reasons why its role is larger and more challenging these days. The centerpiece of the enterprise analytics platform is typically a data lake – an Hadoop-based repository for traditional enterprise data, machine generated sources like digital data, and truly unstructured data from social media, call-center notes, user-feedback and other text-oriented systems.
When you drop all that data into one place, you create an unprecedented set of ETL challenges. Not only have you landed more data sources in one place and with higher integration requirements than ever before, you’ve also introduced several fundamentally new types of data into the warehouse that (as this whole series has argued) require very complex and new forms of data transformation to be useful.
Statistical ETL and the types of manipulations (like the graphing techniques I described last week) are anything but traditional ETL. Similarly, text-oriented sources required advance language processing to create structured meaning that can be used in subsequent analysis. That’s ETL of a deeply analytic sort. So modern enterprise analytics platforms dramatically up the demand for both the quantity and complexity of required ETL.
But that’s not all.
In a traditional database system, ETL is part of a set of production processes to place data in a fixed and highly-optimized form. While the data models I’ve been exploring are clearly similar in scope, one of the key differences in the modern analytics paradigm is that the data model is never fixed. All the models I’ve suggested are merely convenient building blocks for analysts to work from. They are not intended to be a fully described model or a static system. If we’ve learned anything in the past ten years, it’s that analytics requires highly customized data structures to answer particular questions.
The implication of this is clear: in the modern analytics platform, there will be production ETL to support the type of intermediate data structures I’ve been talking about, but most deep-dive analysis tasks will require additional ETL by the analyst. Here’s where Hadoop systems make life even harder (though they have some striking advantages here too). Not too long ago, when database appliances were the high-end analytics platform of choice, people evolved a paradigm of ETL as ELT – where the transformation of the data happens after load. ELT worked because database appliances are very well suited to doing massive transformations on full files using SQL. And the fact that SQL was the transform language made this paradigm much easier for most analysts.
ELT is still the dominant paradigm in the big data world. In fact, it’s almost the only paradigm. But while SQL is available on Hadoop platforms, Hive is neither as robust as a full, standard SQL available on a warehouse appliance nor as performant relative to the capabilities of the system.
If there is anything real to the idea of a data scientist, it resides here in the ability of a single analyst to handle both the data manipulation and analytics chores necessary to do productive work on a big data platform.
Assuming you don’t have a huge team of data scientists who are simultaneously capable of doing deep analytics while writing java code to whip the data into shape, the need for high-powered ETL tools that are useable by analysts seems evident.
Systems intended to do just that are starting to emerge. Pickfire (created by an entrepreneur with years of experience dredging through digital data at the large enterprise level) is a good example and illustrates both the need and the ways in which modern software is evolving. Pickfire lives in the cloud and has a clean, web-based UI, but it generates transform-code written in your native systems. The transform-code supports a wide range of systems even in beta and is open, modifiable, and owned by the enterprise – so Pickfire is a zero-footprint system. You could stop using it at any time and your only loss would be an inability to generate new code.
As a system, it’s designed to address both of the needs that I outlined above. It can be used to create or help bootstrap production ETL (though it doesn’t cover everything you’d need to construct all the types of data structures I’ve discussed) and it can support analysts who aren’t full-stack data scientists but who need to join, process, and modify data at a fairly high-level of sophistication.
Systems like Pickfire give the analysts the ability to easily profile and browse source data stored in all sorts of systems or formats:
Once "loaded", the analyst can explore join strategies, scrub and filter data, and transform fields to create an analytics data set:
This ability to self-serve analyst ETL in a natural, non-programmatic fashion can help the non-technical analyst navigate this brave new world of big data. This new generation of graphical big data analyst-focused ETL systems still have a ways to go before they are fully mature and even further before they can handle all of the ETL tasks I've been describing in this series. But most of the ETL I've been describing is extra complex because it's designed to support a more powerful middle-level data model than exists naturally in the detail model. Once that model is in place, it's much easier to work with and do additional ETL on the data. That's critical.
Because when you get right down to it, a data scientist is little more than an analyst who can do ETL. If you can deploy powerful, easy-to-use ETL systems into the hands of your analysts, you may have the surest, easiest (and least expensive) path to building a true data science team.