We’ve lived with enterprise business intelligence (BI) for several decades now, and it’s no secret that many of us are fed up with the inability of such expensive technology to deliver the business value we need. Some vendors, such as Platfora, have keyed into this frustration by pushing the “panic button” and declaring the end to data warehouses and processes such as Extract, Transform, and Load (ETL). But such histrionics won’t really help us. Just because the ETL process is one of the reasons enterprise BI is slow doesn’t mean it’s unimportant. The idea that we won’t have to clean and disambiguate data is ludicrous.
Similarly, the swan song many vendors want to sing for data warehouses is also a paean to Hadoop, which does afford the capability of performing ETL without constructing the monolithic models of the past. It’s true that info cubes and star schemas may be on their way out, but some degree of programming is still necessary to make Hadoop useful as anything other than mass storage.
Such searing proclamations do clear the air for more sober analysis. As we move into an era where desktop data analysis tools are becoming more and more sophisticated and easier to use, it’s worth asking, in the big data era, what BI should stay and what should go? It turns out, most of it stays, even if it changes clothes.
ETL Is Not Going Anywhere
First of all, ETL is not going anywhere, according to Scott Yara, co-founder and VP of products at EMC Greenplum.
“Is needing to extract data out of operational systems going to go away? No,” he says. “Is transforming raw unstructured data into formats that can be efficiently queried going to change? No. Is loading data into multiple, geographically distributed big data systems is going to change? No.”
What might change is the old guard that has dominated the ETL function for so long. It’s not guaranteed that Informatica will continue to be the tool people use to process big data workloads. But then, if Informatica and other established vendors reassert how they fit in the new architectural landscape, they may earn a second or third life after all, Yara says.
Some kind of relational set-based model will continue to materialize around data, even if star schemas and snowflakes go away. It may not be Oracle or IBM DB2. It may be in the form of Hive or another meta-store on top of Hadoop. But someone will need to create some kind of table or relational model of data.
When it comes time to perform the analysis, at some point the analyst must declare a perspective, and that perspective will have an inherent hierarchy. The data then must be explored within that hierarchy. That hasn’t changed, though some of the platforms for doing so have become more accessible.
“Multidimensional materialization of datasets for interactive data exploration, what some might have called MOLAP [Multidimensional Online Analytical Processing] 20 years ago, which then became ROLAP [Relational Online Analytical Processing], still exists,” Yara says. “Greenplum and others are starting to demonstrate interesting new ways to do this on a Hadoop File System (HDFS) infrastructure.”
All the steps of BI will remain—sorting, ETL, Master Data Management (MDM) —it simply remains to be seen which technologies will be used to perform those steps. Will Cognos, Hyperion, OBIE, Platfora, or QlikView be the new persistence layer for interactive data exploration? It’s unclear. What’s certain is that the threshold for cheaper and easier ways of performing the same tasks is always moving down. As a consequence, all of these vendors will need to redefine themselves as the process moves forward.
Enterprise BI performed a great service to the industry when it was developed, Yara says. But now, some of the speed and procedural compromises those systems made, based on previous generations of architecture, can now be discarded.
“Workloads in an enterprise are like water or gas,” Yara says. “They move where there’s the most open capacity. They flow downstream. So, as systems become cheaper and more flexible, and as they become more powerful to use, workloads move off the previous generations of systems and they move to these new systems.”
A more durable approach for both vendors and CIOs and CTOs is to separate the functions of the value chain for making use of data, which are timeless, from the mechanisms we happen to use right now, which are evolving. With such a perspective, we can have a clear discussion of new capabilities and new ways of making the value chain work. In other words, we can spend our time thinking about how to get work done and avoid the confusion caused by poorly thought out marketing claims.