Where’s the Start Line and Finish Line for Big Data?
“What, Where and How do I start?” These questions areasked most often by many trying to play catch-up with technology industry trends and buzzwords. There are numerous conferences, seminars, webinars and forums on big data and cloud computing and seems overused word in day to day.
There is still some ambiguity about what exactly comprises big data. Is it just sheer volume, or is it mix of volume, variety, velocity regardless the size of data? Or is big data the voluminous unstructured data coming from social media and machine logs? The definition has evolved from 3 Vs to 5 Vs – volume, velocity, variety, verification and value. I believe first 3 (volume, variety and velocity) are attributes and character of data, while the last 2 (verification and value) are part of the process and outcome.
I have been asked whether big data equals unstructured data. I believe the simple tests to define big data involve the basic 3 criteria defined by 3Vs in the original definition.
Organizations in retail, financial, healthcare, financial services and technology inherently deal with massive amounts of structured data coming from variety of sources, generating tremendous velocity, volume and in some aspect variety. For example, the Visa data warehouse system built on IBM DB2 9 has 400 terabytes of primary data and close to 2,000 tables, with thousands of users and very complex processing.
In one of my previous blogs, I mentioned that big data is complementary to enterprise data warehouses and does not replace EDW. Processing information now available as big data adds huge value and brings in new insight that was not tapped previously. In the information management journey, EDW is Union Station and big data is the next grand junction. Over the next few decades, additional junctions will take us to more destinations. For example, artificial intelligence is still in its infancy for day-to-day business operations, and it will use EDW and big data as foundation before it matures and is embedded in mainstream business application.
Organizations are now accumulating terabytes and petabytes of data from various devices – machine, mobile, user, web logs and cookies, social media, etc. The challenge is not in storing this information, but in actually using this data for competitive advantage. Organizations are rushing to store this wealth of information fearing missed opportunities. This takes us back to the key questions: what, where and how do I start?
I believe we have addressed ‘What’ part of the question or challenge, so let’s tackle the “where” part. Organizations can now build a big data platform using IBM Big Insights. Commodity hardware components and new techniques for assembling and analyzing large data sets help companies to experiment. In my lab, it took less than 2 business days to stand up cloud-based infrastructure using Amazon EC2, RightScale, IBM BigInsights and Hadoop. There are many choices available. Now, organizations can hit the ground running with POC with very little time and effort, thanks to the cloud offerings – PAAS, IAAS and SAAS.
Lastly, ‘how’ should you start? While the POC description above addresses the technology, you must also consider process and methodology. Organizations are rushing into big data POCs while storing all possible data from variety of sources, often fearing missed opportunities or not understanding the intelligence that may be tapped.
The key to winning the race to competitive advantage is not by storing all or most of the data, but by deriving value and insight that can be tied to a business plan driving outcomes, ROI and profitability. Here are the high level steps I recommend to begin your big data journey:
1. Identify business use cases tied to business outcomes, metrics and your big data roadmap
2. Identify big data champions from both the business and IT sides of your organization
3. Select infrastructure, tools and architecture for your big data POC/implementation
4. Staff the project with the right big data skills or a strategic big data implementation partner
5. Run your project/POC in sprints or short projects with tangible and measurable outcomes
6. Build on small successes and integrate with EDW and operational applications, including web portals