Will Hadoop replace or augment your Enterprise Data Warehouse?
There is all the buzz about BigData and Hadoop these days and its potential for replacing Enterprise Data Warehouse (EDW). The promise of Hadoop has been the ability to store and process massive amounts of data using commodity hardware that scales extremely well and at very low cost. Hadoop is good for batch oriented work and not really good at OLTP workloads.
The logical question then is do enterprises still need the EDW. Why not simply get rid of the expensive warehouse and deploy a Hadoop cluster with Hbase and Hive. After all you never hear about Google or Facebook using data warehouse systems from Oracle or Teradata or Greenplum.
Before we get into that a little bit of overview on how Hadoop stores data. Hadoop comprise of two components. The Hadoop Distributed File System (HDFS) and the Map-Reduce Engine. HDFS enables you to store all kinds of data (structured as well as unstructured) on commodity servers. Data is divided into blocks and distributed across data nodes. The data itself is processed by using Map-Reduce programs which are typically written in Java. NoSql Databases like HBase and Hive provide a layer on top of HDFS storage that enables end users to use Sql language. In addition BI reporting ,visualization and analytical tools like Cognos, Business Objects, Tableau, SPSS, R etc can now connect to Hadoop/Hive.
A traditional EDW stores structured data from OLTP and back office ERP systems into a relational database using expensive storage arrays with RAID disks. Examples of this structured data may be your customer orders, data from your financial systems, sales orders, invoices etc. Reporting tools like Cognos, Business Objects, SPSS etc are used to run reports and perform analysis on the data.
So are we ready to dump the EDW and move to Hadoop for all our Warehouse needs. There are some things the EDW does very well that Hadoop is still not very good at:
- Hadoop and HBase/Hive are all still very IT focused. They need people with lot of expertise in writing Map reduce Programs in Java, Pig etc. Business Users who actually need the data are not in a position to run ad-hoc queries and analytics easily without involving IT. Hadoop is still maturing and needs lot of IT Hand holding to make it work.
- EDW is well suited for many common business processes, such as monitoring sales by geography, product or channel; extract insight from customer surveys; cost and profitability analyses. The data is loaded into pre-defined schemas/data marts and business users can use familiar tools to perform analysis and run ad-hoc Sql Queries.
- Most EDW come with pre built adaptors for various ERP Systems and databases. Companies have built complex ETL, data marts , analytics, reports etc on top of these warehouse. It will be extremely expensive, time consuming and risky to recode that into a new Hadoop environment. People with Hadoop/Map-reduce expertise are not readily available and are in short supply.
Augment your EDW with Hadoop to add new capabilities and Insight. For the next couple of years, as the Hadoop/BigData landscape evolves augment and enhance your EDW with a Hadoop/BigData cluster as follows:
- Continue to store summary structured data from your OLTP and back office systems into the EDW.
- Store unstructured data into Hadoop that does not fit nicely into “Tables”. This means all the communication with your customers from phone logs, customer feedbacks, GPS locations, photos, tweets, emails, text messages etc can be stored in Hadoop. You can store this lot more cost effectively in Hadoop.
- Co-relate data in your EDW with the data in your Hadoop cluster to get better insight about your customers, products, equipments etc. You can now use this data for analytics that are computation intensive like clustering and targeting. Run ad-hoc analytics and models against your data in Hadoop, while you are still transforming and loading your EDW.
- Do not build Hadoop capabilities within your enterprise in a silo. Big Data/Hadoop technologies should work in tandem with and extend the value of your existing data warehouse and analytics technologies.
- Data Warehouse vendors are adding capabilities of Hadoop and Map-reduce into their offerings. When adding Hadoop capabilities, I would recommend going with a vendor that supports and enhances the open source Hadoop distribution.
In a few years as newer and better analytical and reporting capabilities develop on top of Hadoop, it may eventually be a good platform for all your warehousing needs. Solutions like IBM‘s BigSql and Cloudera’s Impala will make it easier for business users to move more of their warehousing needs to Hadoop by improving query performance and Sql capabilities.