Going back to the gist of my last post, one of the pillars that underpinned de-regulation was the idea that companies would work in a ‘correct’ manner and regulate themselves. The truth is that this worked and still does work very well for 95% of companies but there are always bad pennies committing fraud or simply not being careful in accounting practices. Thanks to a few well known financial disasters, even before the global meltdown, the concept of re-regulation loomed large across many industries. There are many sets of rules that are now in place to bring governance to company business – some of the more well known include Sarbanes-Oxley and Basel II and III which have been around for a little while now. We might ask ourselves what do they have in common and the answer is that both and many more such initiatives, demand that very accurate and accountable numbers are produced quickly from very complex underlying data – the need for Business Intelligence rears its head once again and the term ‘Big Data’ can certainly be applied to some of these initiatives.
Re-regulation demands that some very complex numbers are delivered:
Throw into the pot that the data needed often as not comes from tens or even hundreds of operational systems distributed across the world and that some of these initiatives need very complex predictive modelling and detailed segmentation and we see a new class of Big Data applications.
There is all the buzz about BigData and Hadoop these days and its potential for replacing Enterprise Data Warehouse (EDW). The promise of Hadoop has been the ability to store and process massive amounts of data using commodity hardware that scales extremely well and at very low cost. Hadoop is good for batch oriented work and not really good at OLTP workloads.
The logical question then is do enterprises still need the EDW. Why not simply get rid of the expensive warehouse and deploy a Hadoop cluster with Hbase and Hive. After all you never hear about Google or Facebook using data warehouse systems from Oracle or Teradata or Greenplum.
Before we get into that a little bit of overview on how Hadoop stores data. Hadoop comprise of two components. The Hadoop Distributed File System (HDFS) and the Map-Reduce Engine. HDFS enables you to store all kinds of data (structured as well as unstructured) on commodity servers. Data is divided into blocks and distributed across data nodes. The data itself is processed by using Map-Reduce programs which are typically written in Java. NoSql Databases like HBase and Hive provide a layer on top of HDFS storage that enables end users to use Sql language. In addition BI reporting ,visualization and analytical tools like Cognos, Business Objects, Tableau, SPSS, R etc can now connect to Hadoop/Hive.
A traditional EDW stores structured data from OLTP and back office ERP systems into a relational database using expensive storage arrays with RAID disks. Examples of this structured data may be your customer orders, data from your financial systems, sales orders, invoices etc. Reporting tools like Cognos, Business Objects, SPSS etc are used to run reports and perform analysis on the data.
So are we ready to dump the EDW and move to Hadoop for all our Warehouse needs. There are some things the EDW does very well that Hadoop is still not very good at:
Hadoop and HBase/Hive are all still very IT focused. They need people with lot of expertise in writing Map reduce Programs in Java, Pig etc. Business Users who actually need the data are not in a position to run ad-hoc queries and analytics easily without involving IT. Hadoop is still maturing and needs lot of IT Hand holding to make it work.
EDW is well suited for many common business processes, such as monitoring sales by geography, product or channel; extract insight from customer surveys; cost and profitability analyses. The data is loaded into pre-defined schemas/data marts and business users can use familiar tools to perform analysis and run ad-hoc Sql Queries.
Most EDW come with pre built adaptors for various ERP Systems and databases. Companies have built complex ETL, data marts , analytics, reports etc on top of these warehouse. It will be extremely expensive, time consuming and risky to recode that into a new Hadoop environment. People with Hadoop/Map-reduce expertise are not readily available and are in short supply.
Augment your EDW with Hadoop to add new capabilities and Insight. For the next couple of years, as the Hadoop/BigData landscape evolves augment and enhance your EDW with a Hadoop/BigData cluster as follows:
Continue to store summary structured data from your OLTP and back office systems into the EDW.
Store unstructured data into Hadoop that does not fit nicely into “Tables”. This means all the communication with your customers from phone logs, customer feedbacks, GPS locations, photos, tweets, emails, text messages etc can be stored in Hadoop. You can store this lot more cost effectively in Hadoop.
Co-relate data in your EDW with the data in your Hadoop cluster to get better insight about your customers, products, equipments etc. You can now use this data for analytics that are computation intensive like clustering and targeting. Run ad-hoc analytics and models against your data in Hadoop, while you are still transforming and loading your EDW.
Do not build Hadoop capabilities within your enterprise in a silo. Big Data/Hadoop technologies should work in tandem with and extend the value of your existing data warehouse and analytics technologies.
Data Warehouse vendors are adding capabilities of Hadoop and Map-reduce into their offerings. When adding Hadoop capabilities, I would recommend going with a vendor that supports and enhances the open source Hadoop distribution.
In a few years as newer and better analytical and reporting capabilities develop on top of Hadoop, it may eventually be a good platform for all your warehousing needs. Solutions like IBM‘s BigSql and Cloudera’s Impala will make it easier for business users to move more of their warehousing needs to Hadoop by improving query performance and Sql capabilities.