Hadoop meets SQL

Hadoop meets SQL

Big data technologies like hadoop are providing enterprises a cost effective way to store and analyze data. Enterprises are looking at using hadoop  to augment their traditional data warehouse.  Compared to traditional data warehouse solutions, hadoop can scale using commodity hardware and can be used to store both structured as well as unstructured data. Traditional data warehouses based on relational database technologies have been around for a long time and have mature sets of tools for querying and analysis. Business Users use SQL as the query language to perform ad-hoc queries against these warehouses. Also reporting tools like Cognos , Business Objects, MicroStrategy rely on SQL heavily.  The real value of hadoop is realized when users can access and perform ad-hoc queries data directly on hadoop using tools that support SQL.

Querying a hadoop data store means knowing Map-Reduce programming or writing Pig and Hive scripts.  Hadoop at its core consists of HDFS storage and Map-Reduce Engine. Map-reduce programs being typically written in Java are difficult to write. They are not at all easy to use from a business users perspective.  The ability to use SQL to analyze data stored in hadoop will help in making hadoop go main stream.  This will also enable the business user to reuse their existing SQL knowledge to analyze the data store in hadoop. Various initiatives are underway both in the open source as well as in various companies to solve the problem of enabling SQL on Hadoop.  The following are some of the most common ones.

Hive: Facebook developed Hive as a way of bringing a SQL Like interface for querying hadoop. A Hive Warehouse needs to be created first that provides a schema on top of the data stored in HDFS. Using Hive QL ( Hive Query language) we can use SQL like syntax to query data stored in Hadoop HDFS.  The Sql Support for Hive is very limited at this point.  It does not offer support for full ANSI Sql.  In the case of joins Hive supports  ANSI Join syntax only. It only supports equi-joins at this point. Hive also does not support correlated sub queries which are commonly used in most traditional warehouse queries . Hive is not designed for low latency queries. Hive actually launches map-reduce jobs in the background. So even for small Hive tables, the query will take several minutes. It is really designed to run queries against massive amounts of data, where the query will return results in a few hours. Hive is not suited for real time querying and analysis

Impala:  Cloudera’s Impala provide a fast real time query capability for your data stored in Hadoop using Sql. Impala is based on Google’s Dremel paper. Currently Impala supports a subset of Ansi-92 Sql. There are still some issues on the join table sizes with Impala. If the join results do not fit into the amount of memory available in the hadoop cluster, the join would fail.  Your queries are limited by the amount of memory you have. Impala currently supports hash Joins. Cloudera does provide some recommendations on memory size for data nodes based on their beta customer experience Impala also provides connection using JDBC and ODBC as well as a command line tool.  Cloudera provides some interesting performance data on their site for Impala(http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/).

BigSql:  BigSql is an enterprise class SQL query engine from IBM and is available on IBM’s hadoop distribution BigInsights version 2.1.  BigSql provides full ANSI Sql support as well as support for correlated subqueries. There is no memory size limitation for join tables. BigSql runs on top of a Hive Warehouse or Hbase . BigSql also provides an option for using adaptive Map-reduce to improve the performance of Map-reduce Jobs.  Adaptive Map-reduce comes from IBM’s experience in high performance computing clusters. BigSql supports standard as well as ANSI Join, cross join and non-equi join syntax. It also provides a much wider support for data types compared to Hive. Any BI and visualization tool that uses JDBC/ODBC drivers can use BigSql to connect to a BigInsights Hadoop cluster. BigSql also comes with a command line query tool called JSqsh, which is similar to Oracle Sql*Plus or MySql command line tools. IBM has announced their free download for BigInsights Quickstart VM that comes bundled with BigSql (http://www-01.ibm.com/software/data/infosphere/biginsights/quick-start/)

Google BigQuery: Google launched BigQuery based on their Dremel tool to enable real time querying of their data using Sql Queries. BigQuery provides both synchronous and asynchronous running of queries. However BigQuery is available only if data is loaded into Google’s cloud storage. Google provides a set of RESTFUL API’s to access the queries as well. It supports joins, but there is a table size limitation for one the joining tables. BigQuery is a powerful querying tool if you are using Google Cloud to store your data.

HAWQ: Greenplum announced their HAWQ query engine that runs on top of their  Pivotal HD ( Greenplum’s hadoop Distribution) and can run execute SQL queries against Hadoop. With HAWQ users can query data stored in Hbase, Hive or HDFS. HAWQ  uses the same query optimizer that is used by Greenplum DB.  HAWQ uses dynamic pipelining which is the combination of a bunch of different Greenplum technologies that have been built for the parallel relational database. The dynamic pipelining is a job scheduler for queries (different than job tracker and Name node used by hadoop). This would be a good option for customers who already use Greenplum DB as their warehouse.  They can run the same queries against the Greenplum warehouse as well as the Pivotal HD hadoop cluster.  HAWQ is a proprietary solution from Greenplum.

What is coming next?

Other open source initiatives are in place to address the issue of providing real time query features on Hadoop.

Apache Drill: This is a new open source initiative based on Googles Dremel paper.  The aim is to provide near real time query capabilities on hadoop similar to Google BigQuery.

Stinger : The Stinger Initiative is a Apache project managed by HortonWorks and Microsoft. The aim is to leverage hadoop 2.0 and Yarn to help improve the performance and sql capabilties of Hive. With Stinger, Hive queries will be 100X faster than the current queries. Hive will also support  sub queries and better alignment with ANSI Sql.

 

Key things to consider when looking for Sql Capabilities n Hadoop:

  • Deploying a hadoop distribution that is open source or has open source support will be important.  Hadoop has been a open source initiative and there are lot of contributors who are adding and building capabilities into the platform.  Companies like IBM and EMC Greenplum  are now adding enterprise class features into it and enabling integration with other enterprise data stores.
  • These query tools are designed to be used by small group of users. All the Sql type query tools , run map reduce jobs in the background. Map-reduce inherently scale up very well but does not scale down. Enable a limited set of business users  to run SQL queries against Hadoop using these tools.
  • Enterprise class SQL features are available thru IBM BigSql and GreenPlum HAWQ.  These two query engines are both relatively new.  They are also proprietary solutions tied to their own hadoop distributions.  Both of them do aim to support full ANSI SQL and an enable enterprises  to port and reuse existing queries. If you have an existing warehouse running on DB2, Teradata etc that you want to augment and reuse queries quickly, a BigSql running on a BigInsights hadoop cluster would be a logical choice.
  • All the Query engines have limitations and are not as robust and mature as the standard SQL query tools that are available on traditional warehouses.  To expect business users to quickly use these tools is still a stretch and will need help from technical experts who understand hadoop.

Running Hadoop in the Cloud

Running Hadoop in the Cloud

With the growing popularity of cloud computing enterprises are seriously looking at moving workloads to the cloud. There are issues around  multi-tenancy, data security, software  license, data integration etc that have to be considered before enterprises cam make this shift.  Even then, not all workloads can be easily moved to the  cloud. In recent years, hadoop has gained a lot of interest as a big data technology that can help enterprises, cost effectively  store and analyze massive amounts of data. As enterprises start evaluating hadoop one of the questions frequently asked is “Can we run hadoop in the cloud?”.

To answer this, the following key aspects of the hadoop infrastructure is important to understand:

1. Hadoop best runs on physical servers. A hadoop cluster comprises of a master node called the Name Node and multiple child nodes called Data Nodes. These data nodes are separate physical servers with dedicated storage ( like your PC hard drive), instead of  a common shared storage.

2. Hadoop is “Rack Aware” – Hadoop data node (servers) are installed in racks. Each rack typically contains multiple data node servers with a top of rack switch for network communication. “Rack awareness” means that the Name Node, knows where each data node server is and in which rack.  This  ensures that hadoop can write data to 3 (default) different data nodes, that are not on the same physical rack (prevents data loss due to data node and rack failure). When a Map Reduce job needs access to data blocks, the name node ensures that the job is assigned to the closest data node that contains the data thereby reducing the network traffic.  The hadoop system admin manually maintains this rack awareness information for the cluster. Since the hadoop cluster has a  lot of network traffic it is recommended that the they be isolated into their own network, instead of using VLAN (refer: Brad hedlund’s article on hadoop rack awareness – http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/)

Options for running Hadoop in the Cloud

  • Hadoop as a Service in the Public Cloud : –  Hadoop distributions (Cloudera CDH, IBM BigInsights, Map-R, Hortonworks ) can be launched and run on the public clouds like AWS, Rackspace, MS Azure, IBM SmartCloud etc .. which offer Infrastructure as a Service (IAAS) . In a public cloud you are sharing the infrastructure with other customers. As a result, you have very limited control over which server the VM is being spun up and what other VM’s ( yours or other customers) are running on the same physical server. There is no “rack awareness”  that  you have access to and can configure in the Name Node. The performance and availability of the cluster may be affected as you running  on VM. Enterprises can use and pay for these hadoop clusters on demand.  There are options for creating your own private network using VLAN, but hadoop cluster performance recommendation is to have a separate isolated network because of high network traffic between nodes. In all the cases with the exception of the AWS EMR, you have to install and configure the hadoop cluster on the cloud. 
    • Map-Reduce as a Service – Amazon’s EMR (Elastic Map Reduce) provides a quick and easy way to run Map-Reduce jobs without having to install a hadoop cluster on its cloud.  This can be a good way to develop hadoop programming  expertise internally within your organization or if you only want to run map reduce jobs in your workloads.
    • Hadoop on S3 : You can run Hadoop using Amazon’s S3  instead of HDFS to store data. Performance of S3 is slower than HDFS, but it provides other features like bucket versioning and elasticity as well as its own data loss protection  schemes. This may be an option if your data is already being stored in S3 for your business (e.g. Netflix uses a hadoop cluster using S3)
    • Hadoop in private Cloud:  We have the same set of considerations for a private cloud deployment for hadoop as well. However in case of a private cloud, you may have more control over your infrastructure that will enable you to provision bare metal servers or create a separate isolated network for your hadoop clusters.  Some of these private cloud solutions also provide a Paas layer that offers pre build patterns for deploying hadoop clusters easily (e.g. IBM offers patterns for deploying BigInsights on their SmartCloud Enterprise ). In addition you also have an options of deploying  a “Cloud in a Box”  like the IBM Pure Data/Pure Apps, which is hadoop ready in your own data center. The big reason for private cloud deployment, would be around data security and access control for your data as well better visibility and control of your hadoop infrastructure.

Key things to consider before deploying hadoop cluster in the cloud:

  • Enterprise should evaluate the security criteria for deploying workloads in public cloud, before moving any data into the hadoop cluster.  Hadoop cluster security is very limited. There is no native security for data that will satisfy enterprise data security requirements around SOX, PII, HIPPA etc.
  • Evaluate hadoop distributions that you would want to use and the operating system standards of your enterprise. Preferably go with distributions that are close to the open source apache distributions. All hadoop distributions typically run on Linux.  Hortonworks provides a hadoop distribution for Windows that is currently available on MS Azure cloud.
  • When using AWS, be aware that using hadoop with S3 would tie you to the Amazon’s cloud. For open standards , look at OpenStack based cloud providers like Rackspace, IBM smart cloud, HP etc.
  • Look at the entire hadoop ecosystem and not just the basic hadoop cluster. The value from hadoop is the analytics and data visualization that can be applied on a large data sets. Ensure that the tools you want to use for analytics (e.g. Tableau, R, SPSS etc) are available for use on the cloud provider.
  • Get an understanding on where the data to be loaded into hadoop comes from. Are you going to load data from your internal systems that are not on the cloud or if the data is already in the cloud. Most public cloud provides charge for data transmission fees  if you are moving data back and forth.
  • Hadoop clusters on VM will be slow. You may be able to use these for Dev and test clusters. VMware’s project Serengeti is trying to address the  deployment of hadoop clusters on virtual machines without taking a performance hit. However with this approach you will be  tied to VMware’s hypervisor which should be a criteria to consider when selecting a cloud provider.