Big Data Problems
From very early on, Wal-Mart had a big data problem, before it was called “big data.” So did McDonald’s and any number of other massive companies. These companies also had a big data edge because they had the resources to develop purpose-built solutions that could navigate their data stores and generate vital business insights. In fact, some companies have gotten almost too good at identifying customer behavior; witness the 16-year-old girl whose pregnancy was “outed” by Target’s direct marketing efforts.
Today, you don’t have to be Wal-Mart to have a big data problem or the tools to begin attacking it. Even small businesses now generate and have access to a tremendous amount of data. However, many companies still struggle to find an approach to big data analysis that is both affordable and scalable.
Tools or Obstacles?
Enterprise business intelligence (BI) suites such as SAP BusinessObjects, Oracle’s Hyperion and IBM’s Cognos, and SAS dominate the upper end of the market. These solutions cost hundreds of thousands, even millions of dollars, once the implementation is complete. Even when operational, queries into these systems can become extensive IT projects; self-service can be difficult. The result is delays in getting questions answered.
Lighter, more business-oriented tools such as QlikView and Tableau are easily downloaded and deployed and cut analysis down to seconds. They are quite popular with their users because they are empowering and support a do-it-yourself culture of analysis. These solutions can aggregate data from many sources, but they are focused on the last mile. Most companies have a pipeline of data that flows to these tools that requires the intervention of IT to perform data aggregation or break data down into digestible chunks. These solutions flourish when the needed data is available, but the more IT intervention required, the greater the chance of bottlenecks.
The same can apply to companies that use Hadoop. It’s great for storing data without any predetermined rubric or designated use for it, and it relies on easily procured commodity hardware in a grid. But it is a batch system, and coding in MapReduce is necessary to retrieve data in a usable format for most kinds of analysis. The Hadoop community has attacked this problem head-on with the latest version that expands methods for analyzing data beyond MapReduce. MapR Technologies has created a proprietary version of HDFS that allows access to data in Hadoop through many traditional methods. At its worst, when access to data is restricted and intermediated, Hadoop can become a “data attic” where data objects are tossed and forgotten rather than a vital “data lake” whose streams feed organizational decisionmaking.
Big Data Analytics for the Rest of Us
SiSense calls itself “The Big Data Analytics Company” and it says that it focuses on making a “Big Data Solution for the rest of us.” It’s called Prism, and it’s a software-as-a-service (SaaS) offering designed to work on commodity hardware. Using a columnar database and disk compression, it can start cramming data on a sub-$1,000 laptop with only 8G RAM, for a “fraction” of the cost of SAP HANA or Hyperion, according to Amit Bendov, CEO of SiSense. At one customer, Prism manages 5 billion records.
“It doesn’t cost an arm and a leg and doesn’t require dedicated hardware or IT team,” Bendov says of Prism. “I’m not suggesting you run a billion records on a laptop, but you actually could. We can run on a $3,000 Dell laptop what the larger BI vendors could only do with machines and software costing $150,000 or even higher.”
Further, Prism condenses into one solution all the parts of the big data value chain—acquisition, database, cleansing/ETL, management, analysis, and visualization—some of which still must be acquired separately, even with more expensive BI solutions. As a result, SiSense claims over 350 customers across 48 countries, varying widely in size and characteristics, from Target and Merck to fast growing online companies and a public library network in Poland.
For companies with some sort of infrastructure in place, Prism can act as the missing pieces in a big data value chain. It can perform ETL on data being pumped into Hadoop or other repositories. It can accept data coming out of those repositories and perform analytics, reporting and visualization. Or it can deliver distilled data to solutions like QlikView, Tableau, or other solutions that may be in place. For companies without analytical infrastructure, Prism can be an end-to-end solution.
A Vision for Democratizing Data Science
Bendov is on a mission to democratize data science and make big data analysis easier for small companies. He’s not just taking on the enterprise BI vendors; he’s also challenging the hiring psychology of analytics-hungry organizations. He believes that the obsession with obtaining highly skilled data scientists is misplaced and runs the risk of creating a new priesthood within the organization [see the Forbes article “Will Data Science Become the New Bottleneck?”]. What needs to happen is better access to data, both in terms of acquisition from sources, as well as providing access to business users who want to ask questions and get answers in their own language.
“We half-jokingly say we’re ‘stealing fire from the gods,’ offering tools that were once available to only the top companies, and now everybody has the same weapons,” Bendov says. “Even if you’re a one-person operation, you can still analyze big data in a very simple way. You don’t need to learn data science, R, or predictive machine learning or have a deep understanding of database tables. You can just start analyzing, and, within a matter of hours, get the same insights big companies are getting.”
The key lies in being cost-effective and offering the ability to iteratively ask questions, whose answers will generate more questions, which can be answered in rapid succession, without intervention from IT or data scientists.
Advice for Future Data Science Democrats
SiSense won’t solve all big data problems by itself, of course. In some cases, deep expertise and massively parallel processing will be necessary, Bendov admits.
But he does have some general advice for companies that wish to make the most of their data, irrespective of the technology they use.
- First, Bendov says meaningful analysis won’t happen in any size company without the right investments in data culture.
- “I always tell my people: ‘I’m interested in your opinion on whether we should do A or B, but I’m much more interested in the facts,’” Bendov says. “Even if the data is only 80% correct, at least we know there’s a process. Businesses have to instill a culture of asking for supporting data.”
- If there’s a senior-level commitment to expanding the use of data at all levels of the organization, that is a healthy start that the software needs in order to finish the job of democratizing big data.
- Second, access must be provided to everyone who has a use for data rather than locking it up with experts. “If you make it hard, people will not use data,” he says.
- Third, invest in cleaning and improving data quality. A single source of truth with common definitions is very important. “Data becomes worthless if people don’t trust it anymore,” Bendov says.
- Last, there is a place for data scientists if your organization is lucky enough to acquire them. It’s the data scientist’s job to proactively mine for insights and patterns. It’s also their job to seek out new data sources. The quest for new and valuable data sources should be based on a formally defined program, created at the executive level.
As in government, data science democracy doesn’t just happen with bountiful resources. It must also have a guiding philosophy and a set of governing principles in order to flourish.