The Hadoop Data Warehouse!

The Hadoop Data Warehouse – Wake up call for Traditional EDW!

 

The era of ‘big data’ represents new challenges to businesses. Incoming data volumes are exploding in complexity, variety, speed and volume, while legacy tools have not kept pace. In recent years, a new tool – Apache Hadoop – has appeared on the scene. With the rise of big data has come the rise of the analytic database platform. Few years ago, a company could leverage a traditional DBMS for a data warehouse. However, EDW was built in a time when databases rarely exceeded a few TB in size.

 

According to a new market report published by Transparency Market Research (http://www.transparencymarketresearch.com) “Hadoop Market  Global Industry Analysis, Size, Share, Growth, Trends, and Forecast, 2012- 2018,” the global Hadoop market was worth USD 1.5 billion in 2012 and is expected to reach USD 20.9 billion in 2018, growing at a CAGR of 54.7% from 2012 to 2018. North America was the largest market for Hadoop in 2012 due to huge amounts of data generated in the region and the growing need to store and process the accumulated data.

 

A big data solution is not a single product but architecture suitable for today’s business need. IBM is moving from the consolidated architecture to zone architecture. The architecture is much more modular, meaning instead of one large data repository; data can be stored and analyzed in smaller, more specialized systems that are built for specific functions.

 IBM Big Data

The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. The downside of many EDW based on RDBMS approach is that they become rigid and takes weeks or months for IT/IS organization to add new data source and change existing rules. Hadoop is a popular open source map reduce implementation which is being used in many companies now to store and process extremely large data sets on commodity hardware. In the new big data framework, you can do truly an Agile and iterative development of Analytical and Business Intelligence solution. Hadoop allows storing data on a massive scale at low cost handling the variety, complexity and change much easily as one doesn’t have to conform all the data to a predefined schema like star schema or snowflake schema.

The change to augment Traditional EDW is happening gradually in the industry as big data technology matures and more solutions get added to fill the gaps that exist today. There are now several Hadoop appliances on the market.

 

 Big data tools

 

Make sure you have all the tools to do the job: Solving big data analytic challenges requires a complete ecosystem

 

Advertisements

Making the most of what you have

Many, many companies have built very sophisticated Data Warehouses -They should start using what they’ve got a little more effectively before moving on to tougher things!

So there I was in an ICA store in Stockholm, a huge trolley of goods for the weekend and dead pleased that eventually I got to the front of the queue. It was Saturday, everyone was in a hurry to get home after queuing for ages on the Stockholm motor ways. My partner was diligently packing the goods because it was my turn to pay so imagine my horror when my debit card was rejected – not once, but three times. Crikey, everyone was looking at me as if I was some sort of crook. Well luckily my partners AMEX card came to the rescue but imagine my concern. I kept thinking of the £20k balance in my account and wondering what had happened to it.

In panic on the way home I missed an incoming SMS but got the second when I got back and was horrified to see the number of my bank come up – well I assumed this, as in fact it was actually some random call centre somewhere on planet Earth. I answered it (at my cost as I was roaming) to be told that this was a routine security check because the behavior on my card had proved concerning (to who and why is a mystery as you will see). I was asked to agree the last few transactions of my card to verify that these were correct and not fraudulent: They were:

Currency exchange (at Heathrow)

A purchase at Heathrow of around £30 (two bottles of champers)

Purchase of an airline ticket – UK to Sweden.

Well I confirmed all of this and was simply informed that my card would now start working again – no explanation, no nothing – unbelievable. My card had been refused at a grocery but imagine what could have happened!

Now you might ask yourself a question, why is this guy moaning about this? Well why I’m moaning is that for the two years previous to this incident I had been travelling to Sweden at least once every six weeks – I invariably change money, always buy champagne and always buy an air ticket so why did my bank see this as unusual?  Why weren’t they using some system to check that in fact this was quite a usual style of activity – nothing unusual here? Why has this bank got the authority to arbitrarily stop me using my own money, none the less in such an preposterous manner?

Well, the bank I am talking about was a pioneer in Data Warehousing so I’m just wondering why this event happened when I know that they diligently record all my transactions and store them in a DW whilst apparently failing to understand their meaning. No need for Hadoop here!!!!

Natural Selection in Business – Does using Big Data provide a sustainable advantage?

 

In nature, when resources are plentiful, species live together quite amicably. Even predator and prey reach a satisfactory balance whereby there is always food for both. However, when resources are scarce, species that were once happy together often turn into bitter enemies. The strong, big guy’s fight each other, determined to completely obliterate their competitor often resulting in mortal damage being inflicted on both. Whilst this is happening, the intelligent guys, who are inevitably smaller and physically weaker, get to work. Firstly, they take advantage of the preoccupation of the others by amassing their basic requirements quickly. They then diversify and find a niche for themselves, knowing that competition will come, but being determined to foresee it and avoid it where possible.

 

Most people accept that this is the way of the natural world and business dynamics tend to follow the same basic rules. Intelligent companies will not measure themselves by numbers of employees, amount of real estate or revenue alone, but will instead increasingly judge themselves on different values:

 

  • The average life time value of their key customers
  • The elapsed time for a new customer to become profitable
  • Public image
  • Customer retention
  • Knowledge, expertise and willingness of the work force
  • Brand awareness and flexibility
  • Environmental friendliness
  • Efficient and focused work practices
  • Customer satisfaction

 

Note: be aware that the little guys don’t always have to take on the big guys directly and in fact it’s usually best not too. Those of you who know the story about David and Goliath should be clear that this was not a simple big guy versus little guy competition in which David shows the world not to be afraid of a ‘larger’ opponent. The fact is that Goliath, although being big, had no noticeable weaponry whilst David however, had the equivalent in those days, of a sawn off shotgun. My guess is that if the two guys had met with equal weapons the result would have been rather less romantic but David showed some real common-sense here. He knew that if he wasn’t prepared for the fight he had no chance so he fought the battle very much on his own terms.

 

Will Hadoop replace or augment your Enterprise Data Warehouse?

Will Hadoop replace or augment your Enterprise Data Warehouse?

There is all the buzz about BigData and Hadoop these days and its potential for replacing Enterprise Data Warehouse (EDW). The promise of Hadoop has been the ability to store and process massive amounts of data using commodity hardware that scales extremely well and at very low cost. Hadoop is good for batch oriented work and not really good at OLTP workloads.

The logical question then is do enterprises still need the EDW. Why not simply get rid of the expensive warehouse and deploy a Hadoop cluster with Hbase and Hive. After all you never hear about Google or Facebook using data warehouse systems from Oracle or Teradata or  Greenplum.

Before we get into that a little bit of overview on how Hadoop stores data. Hadoop comprise of two components. The Hadoop Distributed File System (HDFS) and the Map-Reduce Engine.  HDFS enables you to store all kinds of data (structured as well as unstructured) on commodity servers. Data is divided into blocks and  distributed across data nodes. The data itself is processed by using Map-Reduce programs which are typically  written in Java. NoSql Databases like HBase and Hive provide a layer on top of  HDFS storage that enables end users to use Sql language. In addition BI reporting ,visualization and analytical tools like Cognos, Business Objects, Tableau, SPSS, R etc can now connect to Hadoop/Hive.

A traditional EDW stores structured data from OLTP and back office ERP systems into a relational database using expensive storage arrays with RAID disks. Examples of this structured data may be your customer orders, data from your financial systems, sales orders, invoices etc.  Reporting tools like Cognos, Business Objects, SPSS etc are used to run reports and perform analysis on the data.

So are we ready to dump the EDW and move to Hadoop for all our Warehouse needs. There are some things the EDW does very well that Hadoop is still not very good at:

  • Hadoop and HBase/Hive are all still very IT  focused. They need people with lot of expertise in writing Map reduce Programs in Java, Pig etc. Business Users who actually need the data are not in a position to run ad-hoc queries and analytics easily without involving IT. Hadoop is still maturing and needs lot of IT Hand holding to make it work.
  • EDW is well suited for many common business processes, such as monitoring sales by geography, product or channel; extract insight from customer surveys;  cost and profitability analyses. The data is loaded into pre-defined schemas/data marts and business users can use familiar tools to perform analysis and run ad-hoc Sql Queries.
  • Most EDW come with pre built adaptors for various ERP Systems and databases. Companies have built complex ETL, data marts , analytics, reports etc on top of these warehouse. It will be extremely expensive, time consuming and risky  to recode that into a new Hadoop environment. People with Hadoop/Map-reduce expertise are not readily available and are in short supply.

Augment your EDW with Hadoop to add new capabilities and Insight. For the next couple of years, as the Hadoop/BigData landscape evolves augment and enhance your EDW with a Hadoop/BigData cluster as follows:

  • Continue to store summary structured data from your OLTP and back office systems into the EDW.
  • Store unstructured data into Hadoop that does not fit nicely into “Tables”.  This means all the communication with your customers from phone logs, customer feedbacks, GPS locations, photos, tweets, emails, text messages etc can be stored in Hadoop. You can store this lot more cost effectively in Hadoop.
  • Co-relate data  in your EDW with the data in your Hadoop cluster to get better insight about your customers, products, equipments etc. You can now use this data for analytics that are computation intensive like clustering and targeting. Run ad-hoc analytics and models against your data in Hadoop, while you are still transforming and loading your EDW.
  • Do not build Hadoop capabilities within your enterprise in a silo. Big Data/Hadoop technologies should work in tandem with and extend the value of your existing data warehouse and analytics technologies.
  • Data Warehouse vendors are adding capabilities of Hadoop and Map-reduce into their offerings. When adding Hadoop capabilities, I would recommend going with a  vendor that supports and enhances the open source Hadoop distribution.

In a few years as newer and better analytical and reporting capabilities develop on top of Hadoop, it may eventually be a good platform for all your warehousing needs. Solutions like IBM‘s BigSql and Cloudera’s Impala will make it easier for business users to move more of their warehousing needs to Hadoop by improving query performance and Sql capabilities.