Wonder how to start your Big Data POC / Lab?

In continuation to my previous blog – 6 Steps to Start Your Big Data Journey, I want to address here the How should you start your Big Data Journey with the Lab.

What is the Big Data Lab?

 

The Big Data Lab is a dedicated development environment, within your current technology infrastructure, that can be created explicitly for experimentation with emerging technologies and approaches to Big Data and Analytics.

Key Activities within the Big Data Lab:

  • Assemble a selected set of technologies to be evaluated during your 2-3 months.
  • Test permutations against high value use cases
  • Develop recommendations from the testing scenarios to drive future architecture and usage

What should be the Big Data Lab’s objectives?

  • Deliver 2-3 “Quick Wins” to demonstrate the value of these technologies from both an IT and business perspective
  • Create a “Proof-of-Concept” that show’s how these technologies can be integrated into your enterprise existing architecture
  • Future state AI architecture recommendations
  • Deliver low-cost, high-performance agile BI and data discovery, with a focus on Big Data technologies
  • Pilot new analytical capabilities and use cases to prove business value and inform long-term roadmap to compete on analytics
  • Establish a permanent “Innovation Hub” within your architecture and center for Big Data and analytics skill-building

What components to consider in your Big Data Lab?

 

Lab Components Function
Big Data Storage and Processing
  • Use HADOOP and Big Data tools as pre-processing platform for structured and unstructured data before loading to EDW
  • Use HADOOP platform for storing and analyzing unstructured and high volume data
Real-Time Ingestion
  • Use real-time data ingestion into HADOOP
  • Filter data in real-time during collection. ETL high-level data for real-time analysis
Data Virtualization and Federation
  • Enable near-real-time reporting through the ODS and self-service visualization tools
BI, Reporting and Visualization
  • Structured reporting to enable Business Intelligence reporting and self-serve capability
  • Visualization tools to make insights operational
Analytics
  • Predictive analytics and scenario modeling capabilities to improve audience measurement and campaign management
ETL / ELT – Data Integration
  • Custom ETL and data modeling to aggregate multiple data in high-volume and disparate formats
Data Discovery and Exploration
  • Discovery environment that allows for the combination of enterprise data with external data sets
Data Governance
  • Establish a data governance and change management model to ensure that analytics are embraced across the organization

What’s the proposed Hadoop Infrastructure for Big Data Lab:

Image

Big Data Lab’s research mission is to identify, engineer and evaluate innovative technologies that address current and future data-intensive challenges. In order to unlock the potential of Big Data you need to overcome a significant number of research challenges including: managing diverse sources of unstructured data with no common schema, removing the complexity of writing auto-scaling algorithms, real time analytics, suitable visualization techniques for Petabytes scale data sets etc. The Big Lab will provide you the platform to test your hypothesis and integrate your big data efforts across your organization.

 

Big Data Analytics – Acquire, Grow and Retain Customers

Start of this year in Jan 2013, I had discussed in my blog Is Customer the King? In Retail, Analytics Say “Yes” about how Retail industry can leverage big data insights to optimize and personalize customer interactions, improve customer lifetime value, improve customer retention and satisfaction, improve accuracy and response to marketing campaigns. In an article by The Wall Street Journal last year, WSJ said that Big Data refers to the idea that companies can extract value from collecting, processing and analyzing vast quantities of data about their customer experience. Businesses that can get a better handle on these data will be more likely to outperform their competitors who do not. Kimberly Collins, Gartner Research vice-president stated that big data, will be the next major “disruptive technology” to affect the way businesses interact with customers.

 

In this new era of big data, companies need to create team of customer relationship management experts that can understand the psychology and buying behavior of their customers, apply their strong analytical skills to internal and external data and provide a personalized and individualized experience to their customers. In addition, companies will also need to apply futuristic insights using predictive and prescriptive models that will help steer innovation in the industry. Steve Jobs and his company created a need. Nobody knew they needed an iPhone or iPad but today it’s a need for millions of users. Companies need to reorient themselves to 21st century thinking, which unequivocally involves applying big data analytics to their customers (clients, employees and other stakeholders).

 

Today, companies have access to data unlike they have ever had before from internal systems and external media. This includes all structured data and unstructured data. And now companies have access to advanced modeling and visualization tools that can provide the insight to understand customers and even more powerfully, predict and prescribe behaviors.

 

Ironically – athough the retail industry is under tremendous pressure to stay competitive – the industry as a whole lags behind other industries in its use of big data analytics. A report from Ventana Research suggests that only 34% of retail companies are satisfied with the processes they use to create analytics. According to a recent infographic from marketing optimization company Monetate, 32% of retailers don’t know how much data their company store. And more than 75% don’t know how much of their data is unstructured data like call center notes, online forum comments and other information-rich customer data that can’t be analyzed in a database.

 

In one of the recent industry case study, CMO of a retail company convened a group of marketing and product development experts to analyze their leading competitor’s practices, and what they had found was the competitor had made massive investments in its ability to collect, integrate, and analyze data from each store and every sales unit and had used this ability to run myriad real-world experiments testing their hypothesis before implementing them in real world. At the same time, it had linked this information to suppliers’ databases, making it possible to adjust prices in real time, to reorder hot-selling items automatically, and to shift items from store to store easily. By constantly testing, bundling, synthesizing, and making information instantly available across the organization—from the store floor to the CFO’s office—the rival company had become a different, far nimbler type of business. What this customer had witnessed was the fierce market competition with effects of big data.

Retailers that are taking advantage of Big Data’s potential are reaping the rewards.  They’re able to use data to effectively reach consumers through the correct channels and with messages that resonate to a highly targeted audience.  Smart retailers are using advanced revenue attribution and customer-level response modeling to optimize their marketing spends Although there are obvious benefits, many retailers are surprisingly still failing to act on these trends. This delay is largely due to a dependence on siloed information, lack of executive involvement and a general trend among marketers to fail to understand analytics. Without advancing internal structures, gaining executive support or educating internally, jumping on these Big Data trends is nearly impossible.

 

The new IBM/Kantar Retail Global CPG Study of over 350 top CPG executives revealed that 74 percent of leading CPGs use data analytics to improve decision making in sales compared to just 37 percent of lower performing CPGs. By the same token, the new IBM study of 325 senior retail merchandising executives, conducted by IBM Center for Applied Insights in conjunction with Planet Retail, reports that 65 percent of leading retail merchandisers feel big data analytics is critical to their business compared to just 38 percent of other retail companies.

 

The two independently developed studies found interesting trends:

  • Sixty-three percent of top retail merchandisers have the data they need to conduct meaningful analytics while 33 percent of other retailers do not.
  • Thirty-seven percent of leading CPG companies make decisions predominately on data and sophisticated analytics versus 9 percent of lower performing CPG companies.
  • Eighty-three percent of leading retail merchandisers are focusing more on the consumer, compared to just 47 percent of lower performing retailers.
  • Forty-three percent of leading CPG company’s sales organizations are highly focused on the consumer versus 28 percent of others.
  • Sixty-nine percent of the marketing departments of top retail merchandisers are highly collaborative vs. 39 percent of other retailers.
  • Forty-four percent of leading CPG companies report a “robust partnership” between marketing, sales and IT versus only 20 percent of their competitors.

For retailers like Macys, the big data revolution is seen as a key competitive advantage that can bolster razor-thin margins, streamline operations and move more goods off shelves. Kroger CEO David Dillon has called big data analytics his “secret weapon” in fending off other grocery competitors. Retailers are moving quickly into big data, according to Jeff Kelly, lead big data analyst at Wikibon. Big retail chains such as Sears and Target have already invested heavily in reacting to market demand in real time, he said. That means goods can be priced dynamically as they become hot, or not. Similar products can be cross-sold within seconds to a customer paying at the cash register. Data analysis also allows for tighter control of inventory so items aren’t overstocked.

To stay competitive, retailers must understand not only current consumer behavior, but must also be able to predict future consumer behavior. Accurate prediction and an understanding of customer behavior can help retailers keep customers, improve sales, and extend the relationship with their customers. In addition to standard business analytics, retailers need to perform churn analysis to estimate the number of customers in danger of being lost, market analysis to show how customers are distributed between high and low value segments, and market basket analysis to determine those products that customers are more likely to buy together.

 

Retail Banks such as Wells Fargo has gathered electronic data on its customers for decades, but it is only in the past few years that the fourth-largest U.S. bank has learned how to put all that information to work. JPMorgan Chase, Bank of America, Citigroup and Capital One are also taking advantage of the big data opportunity. Big banks are embracing data analysis as a means to pinpoint customer preferences and, as a result, also uncover incremental sources of revenue in a period of stalled revenue growth. Smarter banks will increasingly invest in customer analytics to gain new customer insights and effectively segment their clients. This will help them determine pricing, new products and services, the right customer approaches and marketing methods, which channels customers are most likely to use and how likely customers are to change providers or have more than one provider.

 

Banks, Retailers and CPG companies that are applying big data analytics to better understand consumers and adjust to their needs are outperforming their competitors who don’t, according to a pair of studies released by IBM. Advanced Big Data analytical applications leverage a range of techniques to enable deeper dives into customer data, as well as layering this customer data with sales and product information to help retailers segment and market to customers in the ways they find most compelling and relevant. Historically, retailers have only scratched the surface when it comes to making use of the piles of customer data they already possess. Add social media sentiment to the mix, and they can access a virtual treasure trove of insights into customer behaviors and intentions. The timing couldn’t be better, because these days’ consumers award their tightly held dollars to retailers that best cater to their need for customized offers and better value. The ability to offer just what customers want, when they want it, in the way they want to buy it requires robust customer analytics. The opportunity is now: It’s critical that retailers step up their customer analytics capabilities as they transition to an all-channel approach to business.

 

Addressing the big data security!

Data Security rules have changed in the age of Big Data. The V-Force (Volume, Veracity and Variety) has changed the landscape for data processing and storage in many organizations. Organizations are collecting, analyzing, and making decisions based on analysis of massive amounts of data sets from various sources: web logs, click stream data and social media content to gain better insights about their customers. Their business and security in this process is becoming increasingly more important. IBM estimates that 90 percent of the data that now exists have been created in the past two years.

A recent study conducted by Ponemon Institute LLC in May 2013 showed that average number of breached records was 23,647. German and US companies had the most costly data breaches ($199 and $188 per record, respectively). These countries also experienced the highest total cost (US at $5.4 million and Germany at $4.8 million). On average, Australian and US companies had data breaches that resulted in the greatest number of exposed or compromised records (34,249 and 28,765 records, respectively).

A Forrester report, the “Future of Data Security and Privacy: Controlling Big Data”, observes that security professionals apply most controls at the very edges of the network. However, if attackers penetrate your perimeter, they will have full and unrestricted access to your big data. The report recommends placing controls as close as possible to the data store and the data itself, in order to create a more effective line of defense. Thus, if the priority is data security, then the cluster must be highly secured against attacks.

According to ISACA’s white paper – Privacy and Big Data published in August 2013, enterprises must ask and answer 16 important questions, including these key five questions, which, if ignored, expose the enterprise to greater risk and damage:

  • Can the company trust its sources of Big Data?
  • What information is the company collecting without exposing the enterprise to legal and regulatory battles?
  • How will the company protect its sources, processes and decisions from theft and corruption?
  • What policies are in place to ensure that employees keep stakeholder information confidential during and after employment?
  • What actions are company taking that creates trends that can be exploited by its rivals?

Hadoop, like many open source technologies such as UNIX and TCP/IP, wasn’t originally built with the enterprise in mind, let alone enterprise security. Hadoop’s original purpose was to manage publicly available information such as Web links, and it was designed to format large amounts of unstructured data within a distributed computing environment, specifically Google’s. It was not written to support hardened security, compliance, encryption, policy enablement and risk management.

Here are some specific steps you can take to secure your Big Data:

  • Use Kerberos authentication for validating inter-service communicate and to validate application requests for MapReduce (MR) and similar functions.
  • Use file/OS layer encryption to protect data at rest, ensure administrators or other applications cannot gain direct access to files, and prevent leaked information from exposure. File encryption protects against two attacker techniques for circumventing application security controls. Encryption protects data if malicious users or administrators gain access to data nodes and  directly inspect files, and renders stolen files or copied disk images unreadable
  • Use key/certificate management to store your encryption keys safely and separately from the data you’re trying to protect.
  • Use Automation tools like Chef and Puppet to help you validate nodes during deployment and stay on top of: patching, application configuration, updating the Hadoop stack, collecting trusted machine images, certificates and platform discrepancies.
  • Create/ use log transactions, anomalies, and administrative activity to validate usage and provide forensic system logs.
  • Use SSL or TLS network security to authenticate and ensure privacy of communications between nodes, name servers, and applications. Implement secure communication between nodes, and between nodes and applications. This requires an SSL/TLS implementation that actually protects all network communications rather than just a subset.
  • Anonymize data to remove all data that can be uniquely tied to an individual. Although this technique can protect some personal identification, hence privacy, you need to be really careful about the amount of information you strip out.
  • Use Tokenization technique to protect sensitive data by replacing it with random tokens or alias values that mean nothing to someone who gains unauthorized access to this data.
  • Leverage the Cloud database controls where access controls are built into the database to protect the whole database.
  • Use OS Hardening – the operating system on which the data is processed to harden and lock down data. The four main protection focus areas should be: users, permissions, services, logging.
  • Use In-Line Remediation to update configuration, restrict applications and devices, restrict network access in response to non-compliance.
  • Use the Knox Gateway (“Gateway” or “Knox”) that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal is to simplify Hadoop security for both users (i.e. who access the cluster data and execute jobs) and operators (i.e. who control access and manage the cluster).

A study conducted by Voltage Security showed 76% of senior-level IT and security respondents are concerned about the inability to secure data across big data initiatives. The study further showed that more than half (56%) admitted that these security concerns have kept them from starting or finishing cloud or big data projects. The built-in Apache Hadoop security still has significant gaps for enterprise to leverage them as-is and to address them, multiple vendors of Hadoop distributions: Cloudera, Hortonworks, IBM and others have bolstered security in a few powerful ways.

Cloudera’s Hadoop Distribution now offers Sentry, a new role-based security access control project that will enable companies to set rules for data access down to the level of servers, databases, tables, views and even portions of underlying files.

Its new support for role-based authorization, fine-grained authorization, and multi-tenant administration allows Hadoop operators to:

  • Store more sensitive data in Hadoop,
  • Give more end-users access to that data in Hadoop,
  • Create new use cases for Hadoop,
  • Enable multi-user applications, and
  • Comply with regulations (e.g., SOX, PCI, HIPAA, EAL3)

RSA NetWitness and HP ArcSight ESM now serve as weapons against advanced persistent threats that can’t be stopped by traditional defenses such as firewalls or antivirus systems.

Big data security - Cloudera

Figure 1: Cloudera Sentry Architecture

Hortonworks partner Voltage Security offers data protection solutions that protect data from any source in any format, before it enters Hadoop. Using Voltage Format-Preserving Encryption™ (FPE), structured, semi-structured or unstructured data can be encrypted at source and protected throughout the data life cycle, wherever it resides and however it is used. Protection travels with the data, eliminating security gaps in transmission into and out of Hadoop and other environments. FPE enables data de-identification to provide access to sensitive data while maintaining privacy and confidentiality for certain data fields such as social security numbers that need a degree of privacy while remaining in a format useful for analytics.

Big data security - Hortonworks

Figure 2: Hortonworks Security Architecture

IBM’s BigInsights provides built-in features that can be configured during the installation process. Authorization is supported in BigInsights by defining roles. InfoSphere BigInsights provides four options for authentication: No Authentication, Flat File authentication, LDAP authentication and PAM authentication. In addition, the BigInsights installer provides the option to configure HTTPS to potentially provide more security when a user connects to the BigInsights web console.

Big data security - IBM Big Insights

Figure 3: IBM BigInsights Security Architecture

Intel, one of the latest entrants to the distribution-vendor category — came out with a wish list for Hadoop security under the name Project Rhino

First of all, although today the focus is on technology and technical security issues around big data — and they are important — big data security is not just a technical challenge. Many other domains are also involved, such as legal, privacy, operations, and staffing. Not all big data is created equal, and depending on the data security requirements and risk appetite/profile of an organization, different security controls for big data are required.