Running Hadoop in the Cloud

Running Hadoop in the Cloud

With the growing popularity of cloud computing enterprises are seriously looking at moving workloads to the cloud. There are issues around  multi-tenancy, data security, software  license, data integration etc that have to be considered before enterprises cam make this shift.  Even then, not all workloads can be easily moved to the  cloud. In recent years, hadoop has gained a lot of interest as a big data technology that can help enterprises, cost effectively  store and analyze massive amounts of data. As enterprises start evaluating hadoop one of the questions frequently asked is “Can we run hadoop in the cloud?”.

To answer this, the following key aspects of the hadoop infrastructure is important to understand:

1. Hadoop best runs on physical servers. A hadoop cluster comprises of a master node called the Name Node and multiple child nodes called Data Nodes. These data nodes are separate physical servers with dedicated storage ( like your PC hard drive), instead of  a common shared storage.

2. Hadoop is “Rack Aware” – Hadoop data node (servers) are installed in racks. Each rack typically contains multiple data node servers with a top of rack switch for network communication. “Rack awareness” means that the Name Node, knows where each data node server is and in which rack.  This  ensures that hadoop can write data to 3 (default) different data nodes, that are not on the same physical rack (prevents data loss due to data node and rack failure). When a Map Reduce job needs access to data blocks, the name node ensures that the job is assigned to the closest data node that contains the data thereby reducing the network traffic.  The hadoop system admin manually maintains this rack awareness information for the cluster. Since the hadoop cluster has a  lot of network traffic it is recommended that the they be isolated into their own network, instead of using VLAN (refer: Brad hedlund’s article on hadoop rack awareness –

Options for running Hadoop in the Cloud

  • Hadoop as a Service in the Public Cloud : –  Hadoop distributions (Cloudera CDH, IBM BigInsights, Map-R, Hortonworks ) can be launched and run on the public clouds like AWS, Rackspace, MS Azure, IBM SmartCloud etc .. which offer Infrastructure as a Service (IAAS) . In a public cloud you are sharing the infrastructure with other customers. As a result, you have very limited control over which server the VM is being spun up and what other VM’s ( yours or other customers) are running on the same physical server. There is no “rack awareness”  that  you have access to and can configure in the Name Node. The performance and availability of the cluster may be affected as you running  on VM. Enterprises can use and pay for these hadoop clusters on demand.  There are options for creating your own private network using VLAN, but hadoop cluster performance recommendation is to have a separate isolated network because of high network traffic between nodes. In all the cases with the exception of the AWS EMR, you have to install and configure the hadoop cluster on the cloud. 
    • Map-Reduce as a Service – Amazon’s EMR (Elastic Map Reduce) provides a quick and easy way to run Map-Reduce jobs without having to install a hadoop cluster on its cloud.  This can be a good way to develop hadoop programming  expertise internally within your organization or if you only want to run map reduce jobs in your workloads.
    • Hadoop on S3 : You can run Hadoop using Amazon’s S3  instead of HDFS to store data. Performance of S3 is slower than HDFS, but it provides other features like bucket versioning and elasticity as well as its own data loss protection  schemes. This may be an option if your data is already being stored in S3 for your business (e.g. Netflix uses a hadoop cluster using S3)
    • Hadoop in private Cloud:  We have the same set of considerations for a private cloud deployment for hadoop as well. However in case of a private cloud, you may have more control over your infrastructure that will enable you to provision bare metal servers or create a separate isolated network for your hadoop clusters.  Some of these private cloud solutions also provide a Paas layer that offers pre build patterns for deploying hadoop clusters easily (e.g. IBM offers patterns for deploying BigInsights on their SmartCloud Enterprise ). In addition you also have an options of deploying  a “Cloud in a Box”  like the IBM Pure Data/Pure Apps, which is hadoop ready in your own data center. The big reason for private cloud deployment, would be around data security and access control for your data as well better visibility and control of your hadoop infrastructure.

Key things to consider before deploying hadoop cluster in the cloud:

  • Enterprise should evaluate the security criteria for deploying workloads in public cloud, before moving any data into the hadoop cluster.  Hadoop cluster security is very limited. There is no native security for data that will satisfy enterprise data security requirements around SOX, PII, HIPPA etc.
  • Evaluate hadoop distributions that you would want to use and the operating system standards of your enterprise. Preferably go with distributions that are close to the open source apache distributions. All hadoop distributions typically run on Linux.  Hortonworks provides a hadoop distribution for Windows that is currently available on MS Azure cloud.
  • When using AWS, be aware that using hadoop with S3 would tie you to the Amazon’s cloud. For open standards , look at OpenStack based cloud providers like Rackspace, IBM smart cloud, HP etc.
  • Look at the entire hadoop ecosystem and not just the basic hadoop cluster. The value from hadoop is the analytics and data visualization that can be applied on a large data sets. Ensure that the tools you want to use for analytics (e.g. Tableau, R, SPSS etc) are available for use on the cloud provider.
  • Get an understanding on where the data to be loaded into hadoop comes from. Are you going to load data from your internal systems that are not on the cloud or if the data is already in the cloud. Most public cloud provides charge for data transmission fees  if you are moving data back and forth.
  • Hadoop clusters on VM will be slow. You may be able to use these for Dev and test clusters. VMware’s project Serengeti is trying to address the  deployment of hadoop clusters on virtual machines without taking a performance hit. However with this approach you will be  tied to VMware’s hypervisor which should be a criteria to consider when selecting a cloud provider.

I wonder if exploiting Big Data will enable big companies to grow even bigger or whether it will enable smaller companies to compete with them to level the playing field?

I wonder if exploiting Big Data will enable big companies to grow even bigger or whether it will enable smaller companies to compete with them to level the playing field?


As companies move forward, whilst it will undoubtedly remain an advantage to be rich and powerful, size in itself, may not be such an important plus point. Most certainly size brings coverage and reach, but it also breeds cost and inflexibility and we will see instead the proliferation of many smaller companies who have replaced the advantages of size, with the advantages of intelligence.


What will intelligence bring to a company that might give it sustainable market value?

Well it might enable it to:


  • Sell more diverse products to its customer base thereby increasing margin and perhaps even loyalty.
  • Acquire only those customers who will likely be low risk and high value.
  • Only execute marketing campaigns in geographies where the ability to provide service and product actually exists
  • Remove the need for inventory completely by direct collaboration with suppliers.
  • Reduce the cash to cash cycle by getting customers to pay for goods prior to manufacturing them.
  • Eliminate the need for a direct sales force altogether.
  • Make fraud so unprofitable for the fraudster that they give up.


So what is the major business driver that is set to change our ways of doing business? It can be summed up in one phrase – natural selection.


Note: Now I fancy myself as something of a biologist and there are several points in Darwin’s theories of evolution that concern me but maybe we can save that discussion till later?



Our Favorite 40+ Big Data use-cases. What’s your?

One of the key best practices for successful implementation of a big data analytics solution is to validate the business use case for big data. It will help organization with two important aspects for success:

1. Keeping the scope limited

2. Helping to measure the success of a solution that addresses a key business problem

In case the same data set addresses multiple use cases, an organization may need to prioritize their use case and apply an iterative and phased approach. It’s the theory of getting the biggest bang for the buck, both tactical and strategic. Think Big and Act small!

While there are extensive industry-specific use cases, here are some for handy reference:

EDW Use Cases

  • Augment EDW by offloading processing and storage
  • Support as preprocessing hub before getting to EDW

Retail/Consumer Use Cases

Financial Services Use Cases

  • Compliance and regulatory reporting
  • Risk analysis and management
  • Fraud detection and security analytics
  • CRM and customer loyalty programs
  • Credit risk, scoring and analysis
  • High speed arbitrage trading
  • Trade surveillance
  • Abnormal trading pattern analysis

Web & Digital Media Services Use Cases

  • Large-scale clickstream analytics
  • Ad targeting, analysis, forecasting and optimization
  • Abuse and click-fraud prevention
  • Social graph analysis and profile segmentation
  • Campaign management and loyalty programs

Health & Life Sciences Use Cases

  • Clinical trials data analysis
  • Disease pattern analysis
  • Campaign and sales program optimization
  • Patient care quality and program analysis
  • Medical device and pharma supply-chain management
  • Drug discovery and development analysis

Telecommunications Use Cases

  • Revenue assurance and price optimization
  • Customer churn prevention
  • Campaign management and customer loyalty
  • Call detail record (CDR) analysis
  • Network performance and optimization
  • Mobile user location analysis

Government Use Cases

  • Fraud detection
  • Threat detection
  • Cybersecurity
  • Compliance and regulatory analysis

New Application Use Cases

  • Online dating
  • Social gaming

Fraud Use-Cases

  • Credit and debit payment card fraud
  • Deposit account fraud
  • Technical fraud and bad debt
  • Healthcare fraud
  • Medicaid and Medicare fraud
  • Property and casualty (P&C) insurance fraud
  • Workers’ compensation fraud

E-Commerce and Customer Service Use-Cases

  • Cross-channel analytics
  • Event analytics
  • Recommendation engines using predictive analytics
  • Right offer at the right time
  • Next best offer or next best action

These are some of my favorites and ones that I have come across. Please add your favorites to the comment section. I would like to know from readers what they are seeing in their organization.


Invitation to The Big Data Institute (TBDI) Partnership program

We are contacting you as a follower of TBDI Blog and LinkedIn group and to explore your organization’s interest in becoming a Partner with The Big Data Institute (TBDI).

TBDI is a California registered non-profit organization for Big Data Analytics and Data Science for executives and professionals worldwide. It provides Free and Premium membership with TBDI and provides access to sponsored content such as research, Webinars, and white papers.

By being a corporate partner, your organization has the opportunity to participate in this leading Big Data Analytics & Data Science organization worldwide. As a TBDI Partner, you join an exclusive group of companies that share the TBDI commitment to quality education, content, and knowledge transfer to Big Data Analytics and Data Science professionals worldwide. TBDI Partners receive special benefits for maximum exposure to quality audiences. This includes exclusive exposure at our U.S., EMEA, AsiaPacific events, on our highly trafficked website, and within our well-respected publications.

Partners have access to proven marketing solutions that increase brand awareness, deliver superior lead generation, and ensure Big Data Analytics and Data Science professionals worldwide are familiar with your products and services. Becoming a TBDI Partner provides a cost-effective way to participate in a variety of marketing opportunities, including events, online exposure, research, Webinars, and publications.

I would like to recommend that your organization become a (Titanium, Platinum, Gold or Silver) Partner and receive the following benefits;

Print and Online Exposure
• Exclusive! Partner logo for your website, collateral, and journal.
• Your own content page in the Partner area of
• Two white papers featured in the TBDI White Paper Library for a three-month period (includes lead generation)
• Your logo displayed in an exclusive full-page Partner ad in the quarterly publication – The Big Data & Data Science Journal
• 5% discount on publication advertising and sponsorships, research, and Webinars
• Inclusion of links to all related Partner content on
– Sponsored Webinars
– Checklist Reports
– Currently listed white papers
– Solutions Gateways
– Videos

Professional Development
• TBDI Enterprise Team Membership for up to 10 people in your organization

One More Benefit
• Access to our Social Media Groups – LinkedIn, Twitter, Facebook, Google+.

To become a TBDI Partner, please contact us at Should you need more information than is presented, please feel free to contact us and we will assist you in any way we can.

Best regards,


A new type of Blog? What Big Data might mean to the business

Hi, it’s a great pleasure to be posting my first entry on TBDI blog. This won’t be a collection of fairly random observations or comments  and it won’t be too technical because in truth, there are just too many such blogs around today. Instead I’m going to examine data issues very much from a business perspective


In the last couple of centuries or so there has been some prescribed order in the world of business. Big subsumed small, powerful overcame week and the rich made times very tough for the poor. In broad terms, the outcome has been that the world is dominated by old and traditional companies which have become rich and powerful whilst amassing unbelievable wealth. This situation has been fine for many, many years but as we shall see, what was fine before is maybe unsustainable now.


This will be my theme for a while during which  I’ll examine the role Big Data might take in shaping business based on many lessons I hoped we had learned over the last 20 years or so. Next entry…….next week sometime, I’m going for a schedule of one per week!!

Jon Page Thought Leader EMEA TBDI

TBDI Definition…

TBDI Definition to Big Data!

Big Data is a term applied to voluminous data objects that are variety in nature – structured, unstructured or a semi-structured, including sources internal or external to an organization, and generated at a high degree of velocity with an uncertainty pattern, that does not fit neatly into traditional, structured, relational data stores and requires strong sophisticated information ecosystem with high performance computing platform and analytical capabilities to capture, process, transform, discover and derive business insights and value within a reasonable elapsed time.

Will Hadoop replace or augment your Enterprise Data Warehouse?

Will Hadoop replace or augment your Enterprise Data Warehouse?

There is all the buzz about BigData and Hadoop these days and its potential for replacing Enterprise Data Warehouse (EDW). The promise of Hadoop has been the ability to store and process massive amounts of data using commodity hardware that scales extremely well and at very low cost. Hadoop is good for batch oriented work and not really good at OLTP workloads.

The logical question then is do enterprises still need the EDW. Why not simply get rid of the expensive warehouse and deploy a Hadoop cluster with Hbase and Hive. After all you never hear about Google or Facebook using data warehouse systems from Oracle or Teradata or  Greenplum.

Before we get into that a little bit of overview on how Hadoop stores data. Hadoop comprise of two components. The Hadoop Distributed File System (HDFS) and the Map-Reduce Engine.  HDFS enables you to store all kinds of data (structured as well as unstructured) on commodity servers. Data is divided into blocks and  distributed across data nodes. The data itself is processed by using Map-Reduce programs which are typically  written in Java. NoSql Databases like HBase and Hive provide a layer on top of  HDFS storage that enables end users to use Sql language. In addition BI reporting ,visualization and analytical tools like Cognos, Business Objects, Tableau, SPSS, R etc can now connect to Hadoop/Hive.

A traditional EDW stores structured data from OLTP and back office ERP systems into a relational database using expensive storage arrays with RAID disks. Examples of this structured data may be your customer orders, data from your financial systems, sales orders, invoices etc.  Reporting tools like Cognos, Business Objects, SPSS etc are used to run reports and perform analysis on the data.

So are we ready to dump the EDW and move to Hadoop for all our Warehouse needs. There are some things the EDW does very well that Hadoop is still not very good at:

  • Hadoop and HBase/Hive are all still very IT  focused. They need people with lot of expertise in writing Map reduce Programs in Java, Pig etc. Business Users who actually need the data are not in a position to run ad-hoc queries and analytics easily without involving IT. Hadoop is still maturing and needs lot of IT Hand holding to make it work.
  • EDW is well suited for many common business processes, such as monitoring sales by geography, product or channel; extract insight from customer surveys;  cost and profitability analyses. The data is loaded into pre-defined schemas/data marts and business users can use familiar tools to perform analysis and run ad-hoc Sql Queries.
  • Most EDW come with pre built adaptors for various ERP Systems and databases. Companies have built complex ETL, data marts , analytics, reports etc on top of these warehouse. It will be extremely expensive, time consuming and risky  to recode that into a new Hadoop environment. People with Hadoop/Map-reduce expertise are not readily available and are in short supply.

Augment your EDW with Hadoop to add new capabilities and Insight. For the next couple of years, as the Hadoop/BigData landscape evolves augment and enhance your EDW with a Hadoop/BigData cluster as follows:

  • Continue to store summary structured data from your OLTP and back office systems into the EDW.
  • Store unstructured data into Hadoop that does not fit nicely into “Tables”.  This means all the communication with your customers from phone logs, customer feedbacks, GPS locations, photos, tweets, emails, text messages etc can be stored in Hadoop. You can store this lot more cost effectively in Hadoop.
  • Co-relate data  in your EDW with the data in your Hadoop cluster to get better insight about your customers, products, equipments etc. You can now use this data for analytics that are computation intensive like clustering and targeting. Run ad-hoc analytics and models against your data in Hadoop, while you are still transforming and loading your EDW.
  • Do not build Hadoop capabilities within your enterprise in a silo. Big Data/Hadoop technologies should work in tandem with and extend the value of your existing data warehouse and analytics technologies.
  • Data Warehouse vendors are adding capabilities of Hadoop and Map-reduce into their offerings. When adding Hadoop capabilities, I would recommend going with a  vendor that supports and enhances the open source Hadoop distribution.

In a few years as newer and better analytical and reporting capabilities develop on top of Hadoop, it may eventually be a good platform for all your warehousing needs. Solutions like IBM‘s BigSql and Cloudera’s Impala will make it easier for business users to move more of their warehousing needs to Hadoop by improving query performance and Sql capabilities.