Hadoop meets SQL

Hadoop meets SQL

Big data technologies like hadoop are providing enterprises a cost effective way to store and analyze data. Enterprises are looking at using hadoop  to augment their traditional data warehouse.  Compared to traditional data warehouse solutions, hadoop can scale using commodity hardware and can be used to store both structured as well as unstructured data. Traditional data warehouses based on relational database technologies have been around for a long time and have mature sets of tools for querying and analysis. Business Users use SQL as the query language to perform ad-hoc queries against these warehouses. Also reporting tools like Cognos , Business Objects, MicroStrategy rely on SQL heavily.  The real value of hadoop is realized when users can access and perform ad-hoc queries data directly on hadoop using tools that support SQL.

Querying a hadoop data store means knowing Map-Reduce programming or writing Pig and Hive scripts.  Hadoop at its core consists of HDFS storage and Map-Reduce Engine. Map-reduce programs being typically written in Java are difficult to write. They are not at all easy to use from a business users perspective.  The ability to use SQL to analyze data stored in hadoop will help in making hadoop go main stream.  This will also enable the business user to reuse their existing SQL knowledge to analyze the data store in hadoop. Various initiatives are underway both in the open source as well as in various companies to solve the problem of enabling SQL on Hadoop.  The following are some of the most common ones.

Hive: Facebook developed Hive as a way of bringing a SQL Like interface for querying hadoop. A Hive Warehouse needs to be created first that provides a schema on top of the data stored in HDFS. Using Hive QL ( Hive Query language) we can use SQL like syntax to query data stored in Hadoop HDFS.  The Sql Support for Hive is very limited at this point.  It does not offer support for full ANSI Sql.  In the case of joins Hive supports  ANSI Join syntax only. It only supports equi-joins at this point. Hive also does not support correlated sub queries which are commonly used in most traditional warehouse queries . Hive is not designed for low latency queries. Hive actually launches map-reduce jobs in the background. So even for small Hive tables, the query will take several minutes. It is really designed to run queries against massive amounts of data, where the query will return results in a few hours. Hive is not suited for real time querying and analysis

Impala:  Cloudera’s Impala provide a fast real time query capability for your data stored in Hadoop using Sql. Impala is based on Google’s Dremel paper. Currently Impala supports a subset of Ansi-92 Sql. There are still some issues on the join table sizes with Impala. If the join results do not fit into the amount of memory available in the hadoop cluster, the join would fail.  Your queries are limited by the amount of memory you have. Impala currently supports hash Joins. Cloudera does provide some recommendations on memory size for data nodes based on their beta customer experience Impala also provides connection using JDBC and ODBC as well as a command line tool.  Cloudera provides some interesting performance data on their site for Impala(http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/).

BigSql:  BigSql is an enterprise class SQL query engine from IBM and is available on IBM’s hadoop distribution BigInsights version 2.1.  BigSql provides full ANSI Sql support as well as support for correlated subqueries. There is no memory size limitation for join tables. BigSql runs on top of a Hive Warehouse or Hbase . BigSql also provides an option for using adaptive Map-reduce to improve the performance of Map-reduce Jobs.  Adaptive Map-reduce comes from IBM’s experience in high performance computing clusters. BigSql supports standard as well as ANSI Join, cross join and non-equi join syntax. It also provides a much wider support for data types compared to Hive. Any BI and visualization tool that uses JDBC/ODBC drivers can use BigSql to connect to a BigInsights Hadoop cluster. BigSql also comes with a command line query tool called JSqsh, which is similar to Oracle Sql*Plus or MySql command line tools. IBM has announced their free download for BigInsights Quickstart VM that comes bundled with BigSql (http://www-01.ibm.com/software/data/infosphere/biginsights/quick-start/)

Google BigQuery: Google launched BigQuery based on their Dremel tool to enable real time querying of their data using Sql Queries. BigQuery provides both synchronous and asynchronous running of queries. However BigQuery is available only if data is loaded into Google’s cloud storage. Google provides a set of RESTFUL API’s to access the queries as well. It supports joins, but there is a table size limitation for one the joining tables. BigQuery is a powerful querying tool if you are using Google Cloud to store your data.

HAWQ: Greenplum announced their HAWQ query engine that runs on top of their  Pivotal HD ( Greenplum’s hadoop Distribution) and can run execute SQL queries against Hadoop. With HAWQ users can query data stored in Hbase, Hive or HDFS. HAWQ  uses the same query optimizer that is used by Greenplum DB.  HAWQ uses dynamic pipelining which is the combination of a bunch of different Greenplum technologies that have been built for the parallel relational database. The dynamic pipelining is a job scheduler for queries (different than job tracker and Name node used by hadoop). This would be a good option for customers who already use Greenplum DB as their warehouse.  They can run the same queries against the Greenplum warehouse as well as the Pivotal HD hadoop cluster.  HAWQ is a proprietary solution from Greenplum.

What is coming next?

Other open source initiatives are in place to address the issue of providing real time query features on Hadoop.

Apache Drill: This is a new open source initiative based on Googles Dremel paper.  The aim is to provide near real time query capabilities on hadoop similar to Google BigQuery.

Stinger : The Stinger Initiative is a Apache project managed by HortonWorks and Microsoft. The aim is to leverage hadoop 2.0 and Yarn to help improve the performance and sql capabilties of Hive. With Stinger, Hive queries will be 100X faster than the current queries. Hive will also support  sub queries and better alignment with ANSI Sql.


Key things to consider when looking for Sql Capabilities n Hadoop:

  • Deploying a hadoop distribution that is open source or has open source support will be important.  Hadoop has been a open source initiative and there are lot of contributors who are adding and building capabilities into the platform.  Companies like IBM and EMC Greenplum  are now adding enterprise class features into it and enabling integration with other enterprise data stores.
  • These query tools are designed to be used by small group of users. All the Sql type query tools , run map reduce jobs in the background. Map-reduce inherently scale up very well but does not scale down. Enable a limited set of business users  to run SQL queries against Hadoop using these tools.
  • Enterprise class SQL features are available thru IBM BigSql and GreenPlum HAWQ.  These two query engines are both relatively new.  They are also proprietary solutions tied to their own hadoop distributions.  Both of them do aim to support full ANSI SQL and an enable enterprises  to port and reuse existing queries. If you have an existing warehouse running on DB2, Teradata etc that you want to augment and reuse queries quickly, a BigSql running on a BigInsights hadoop cluster would be a logical choice.
  • All the Query engines have limitations and are not as robust and mature as the standard SQL query tools that are available on traditional warehouses.  To expect business users to quickly use these tools is still a stretch and will need help from technical experts who understand hadoop.

Making the most of what you have

Many, many companies have built very sophisticated Data Warehouses -They should start using what they’ve got a little more effectively before moving on to tougher things!

So there I was in an ICA store in Stockholm, a huge trolley of goods for the weekend and dead pleased that eventually I got to the front of the queue. It was Saturday, everyone was in a hurry to get home after queuing for ages on the Stockholm motor ways. My partner was diligently packing the goods because it was my turn to pay so imagine my horror when my debit card was rejected – not once, but three times. Crikey, everyone was looking at me as if I was some sort of crook. Well luckily my partners AMEX card came to the rescue but imagine my concern. I kept thinking of the £20k balance in my account and wondering what had happened to it.

In panic on the way home I missed an incoming SMS but got the second when I got back and was horrified to see the number of my bank come up – well I assumed this, as in fact it was actually some random call centre somewhere on planet Earth. I answered it (at my cost as I was roaming) to be told that this was a routine security check because the behavior on my card had proved concerning (to who and why is a mystery as you will see). I was asked to agree the last few transactions of my card to verify that these were correct and not fraudulent: They were:

Currency exchange (at Heathrow)

A purchase at Heathrow of around £30 (two bottles of champers)

Purchase of an airline ticket – UK to Sweden.

Well I confirmed all of this and was simply informed that my card would now start working again – no explanation, no nothing – unbelievable. My card had been refused at a grocery but imagine what could have happened!

Now you might ask yourself a question, why is this guy moaning about this? Well why I’m moaning is that for the two years previous to this incident I had been travelling to Sweden at least once every six weeks – I invariably change money, always buy champagne and always buy an air ticket so why did my bank see this as unusual?  Why weren’t they using some system to check that in fact this was quite a usual style of activity – nothing unusual here? Why has this bank got the authority to arbitrarily stop me using my own money, none the less in such an preposterous manner?

Well, the bank I am talking about was a pioneer in Data Warehousing so I’m just wondering why this event happened when I know that they diligently record all my transactions and store them in a DW whilst apparently failing to understand their meaning. No need for Hadoop here!!!!

Welcome to new Partners and Alliances

We are contacting you as a valued follower of TBDI blog to explore your interest in becoming a Partner with The Big Data Institute (TBDI). TBDI is a premier source of information for Big Data, Data Science and Advanced Analytics. TBDI is a professional voluntary organization for Big Data Analytics and Data Science for executives and professionals worldwide. It provides Free and Premium membership with TBDI and provides access to sponsored content such as research, Webinars, and white papers.

By being a corporate partner, your organization has the opportunity to participate in this leading Big Data Analytics & Data Science organization worldwide. As a TBDI Partner, you join an exclusive group of companies that share the TBDI commitment to quality education, content, and knowledge transfer to Big Data and Data Science professionals worldwide. TBDI Partners receive special benefits for maximum exposure to quality audiences. This includes exclusive exposure at our U.S., EMEA, AsiaPacific events, on our highly trafficked website, and within our well-respected publications.

Partners have access to proven marketing solutions that increase brand awareness, deliver superior lead generation, and ensure Big Data Analytics and Data Science professionals worldwide are familiar with your products and services. Becoming a TBDI Partner provides a cost-effective way to participate in a variety of marketing opportunities, including events, online exposure, research, Webinars, and publications.

I would like to recommend that your organization become a (Titanium, Platinum, Gold or Silver) Partner and receive the following benefits;

Exclusive Exposure at TBDI Events

  • Preferential booth selection and discounts when exhibiting at TBDI conferences
  • Partner logo floor decal in front of your booth
  • Special! Partner-only welcome reception for attendees at Conferences before the exhibit hall opens;
  • Exclusive logo placement:
    • Conference brochures
    • Exhibit hall banners
    • Conference program guide
  • Exposure in the conference brochure, program guide, and on the exhibit hall banner at TBDI Conferences
  • Sponsorship on The-BigDataInstitute.com, the Web portal for all World Conferences, including:
    • Exposure to visitors to the site per event
    • E-mail promoting of the portal to TBDI Customers, including Partner logo and link
    • Partner logo and company description on The-BigDataInstitute.com page
    • Inclusion of text ad or up to four assets in Partner box on The-BigDataInstitute.com

Print and Online Exposure

  • Exclusive! Partner logo for your website, collateral, and exhibit space
  • Your own content page in the Partner area of The-BigDataInstitute.com
  • Two white papers featured in the TBDI White Paper Library for a three-month period (includes lead generation)
  • Your logo displayed in an exclusive full-page Partner ad in the quarterly publication – The Big Data & Data Science Journal
  • 5% discount on publication advertising and sponsorships, research, and Webinars
  • Inclusion of links to all related Partner content on The-BigDataInstitute.com:
    – Sponsored Webinars
    – Checklist Reports
    – Currently listed white papers
    – Solutions Gateways
    – Videos

Professional Development

  • 20% discount on speaking engagements
  • TBDI Enterprise Team Membership for up to 20 people in your organization

One More Benefit

To become a TBDI Partner, please contact Sushil Pramanick at sushil.pramanick@the-bigdatainstitute.com or Sales at sales@the-bigdatainstitute.com

Should you need more information than is presented, please feel free to contact me and I will assist you in any way I can.

A response in next 2 weeks would be greatly appreciated, so that I may let The Big Data Institute (TBDI) know of your commitment.  Thank you!