Introduction to Big Data and Hadoop Ecosystem – For Beginners!

We live in the age where data grows at a faster rate than it’s preceding second in time. Very soon, it will be defeating the Moore’s theory in specific to growth rate in data volume. Big Data refers to such exponential explosion of data that can’t be handled by traditional architectural and structural data solutions. The four key attributes of Big Data is Velocity, Volume, Variety and Value. If you look around your environment, every gadget you use generates data that can possibly shape how you will use it next time.

For example, the cable channel can store your preferences on genre and attributes of movies or shows you like, advertisement you skip or you watch and build a custom channel suited to your needs with your shows and advertisement. Another example, the car you drive will be able to transmit real-time your driving patterns, violations and speed to DMV and Insurance companies that may affect your insurance rates. Logistics and Manufacturing companies captures millions of unstructured data daily from server logs, machine sensors, RFID and supply chain processes that can be mined to be more cost effective and productive. Imagine living in a sci-fi world where by companies will get real time data from your cellphones, IPads and tablets, laptops, game controllers and Social Media, electronic channels to know about you and present you with products right in time before you ask for one! There are endless possibilities in future to harness the power of data into information to provide custom solutions and products to human race.

It’s is said that structured data makes up only about 20% of data and rest 80% of is unstructured that comes in complex, unstructured formats, everything from web sites, social media and email, to videos, presentations, etc. In the past we have been overwhelmed with structured data and we built big Sun servers and IBM servers but given the Petabytes of data and logs to process, the industry demands more scalable, robust and performance optimized solution to process this information. Over a decade back, Google designed scalable frameworks like MapReduce and Google File System. Inspired by these designs, an Apache open source initiative was started under the name Hadoop. Apache Hadoop is a framework that allows for the distributed processing of such large data sets across clusters of machines.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop consists of 2 sub-projects – Hadoop MapReduce and Hadoop Distributed File System. Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. Other Hadoop-related projects at Apache include Cassandra, Chukwa, Hive, HBase, Mahout, Sqoop, ZooKeeper, Jaql, Avro, Pig.


Hadoop Distributed File System (HDFS™) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.


Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes.


The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra’s Column Family data model offers the convenience of column indexes with the performance of log-structured updates, strong support for materialized views, and powerful built-in caching.

Cassandra is in use at Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala, and more companies that have large, active data sets. The largest known Cassandra cluster has over 300 TB of data in over 400 machines.


Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

Flume from Cloudera is similar to Chukwa both in architecture and features. Architecturally, Chukwa is a batch system. In contrast, Flume is designed more as a continuous stream processing system.


Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

The main building blocks of Hive are –

  1. Metastore stores the system catalog and metadata about tables, columns, partitions, etc.
  2. Driver manages the lifecycle of a HiveQL statement as it moves through Hive
  3. Query Compiler compiles HiveQL into a directed acyclic graph for MapReduce tasks
  4. Execution Engine executes the tasks produced by the compiler in proper dependency order
  5. HiveServer provides a Thrift interface and a JDBC / ODBC server


HBase is the Hadoop database. Think of it as a distributed, scalable, big data store.

Use HBase when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.


  • Linear and modular scalability.
  • Strictly consistent reads and writes.
  • Automatic and configurable sharding of tables
  • Automatic failover support between RegionServers.
  • Convenient base classes for backing Hadoop MapReduce jobs with HBase tables.
  • Easy to use Java API for client access.
  • Block cache and Bloom Filters for real-time queries.
  • Query predicate push down via server side Filters
  • Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
  • Extensible jruby-based (JIRB) shell
  • Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX


The success of companies and individuals in the data age depends on how quickly and efficiently they turn vast amounts of data into actionable information. Whether it’s for processing hundreds or thousands of personal e-mail messages a day or divining user intent from petabytes of weblogs, the need for tools that can organize and enhance data has never been greater. Therein lies the premise and the promise of the field of machine learning.

How do we easily move all these concepts to big data? Welcome Mahout!

Mahout is an open source machine learning library from Apache. It’s highly scalable. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine. At the moment, it primarily implements recommender engines (collaborative filtering), clustering, and classification.

Recommender engines try to infer tastes and preferences and identify unknown items that are of interest. Clustering attempts to group a large number of things together into clusters that share some similarity. It’s a way to discover hierarchy and order in a large or hard-to-understand data set. Classification decides how much a thing is or isn’t part of some type or category, or how much it does or doesn’t have some attribute.


Loading bulk data into Hadoop from production systems or accessing it from map-reduce applications running on large clusters can be a challenging task. Transferring data using scripts is inefficient and time-consuming.

How do we efficiently move data from an external storage into HDFS or Hive or HBase? Meet Apache Sqoop. Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset.


ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

Eclipse is a popular IDE donated by IBM to the open source community.


Lucene is a text search engine library written in Java. Lucene provides Java-based indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.


Solr is a high performance search server built using Lucene Core, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface


Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

At the present time, Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig’s language layer currently consists of a textual language called Pig Latin, which has the following key properties:

  • Ease of programming. It is trivial to achieve parallel execution of simple, “embarrassingly parallel” data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
  • Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
  • Extensibility. Users can create their own functions to do special-purpose processing.


A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.


JAQL or jackal, is a query language for JavaScript open notation.


Avro is a data serialization system. Avro provides:

  • Rich data structures.
  • A compact, fast, binary data format.
  • A container file, to store persistent data.
  • Remote procedure call (RPC).
  • Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

UIMA is the architecture for the development, discovery, composition and deployment for the analysis of unstructured data.



Join us this Thursday, May 2nd for our next Big Data Developer Meetup!

This meetup will focus on Real-Time Location Based Analytics and will include a presentation, demo, hands-on session (bring your laptop), and pizza! Doors open at 5:30pm for registration and networking, while the program starts at 6:30pm.

With smart phones and fully instrumented cars, the amount of data we can collect from moving objects is growing at staggering rates. In a few years, the automotive industry will be the largest data producer of data after utilities — bigger than health care. And with this Big Data volume comes Big Data challenges. The opportunity for applying all this data in real-time to problems in transportation, congestion management, emergency response, microweather prediction, supply chain management, and so on is tremendous. But this requires a real-time analytics platform that can integrate GPS locations, telematics messages and sensor readings, video, and other kinds of information–and scale up to any level.

Join us to learn about how IBM InfoSphere Streams is addressing this Big Data challenge. The event will kick off with a presentation and will be followed by a live demo.  Bring your laptop, because you’ll then have an opportunity to get hands on with InfoSphere Streams and apply this exciting new technology yourself!

Pizza & beverages will be provided!


5:30pm: Registration & Networking

6:30pm: Presentation

7:15pm: Pizza Break

7:30pm: Demo

8:15pm: Hands On with Real-time Analytics


To visit Big Data Developers, go here:

Big Data Use-Cases in Healthcare – Provider, Payer and Care Management

In this part, we will discuss use cases specific to Healthcare industry. In general, Healthcare industry has been late adopter of technology compared to other industry verticals – Banking and Finance, Retail and Insurance. As per McKinsey report on Big Data June 2011, “…if US health care could use big data creatively and effectively to drive efficiency and quality, we estimate that the potential value from data in the sector could be more than $300 billion in value every year, two-thirds of which would be in the form of reducing national health care expenditures by about 8 percent…”.

Some of the key use cases for Provider industry are:

a. Reduce Medicaid Re-admissions – One of the major cost of Medicaid is readmission costs due to lack of sufficient follow ups and proactive engagement with patients. These follow-up appointments and tests are often only documented as free-text in patients’ hospital discharge summaries and notes. These unstructured data can be mined using text analytics and timely alerts can be sent, appointments can be scheduled, education materials can be dispatched. This proactive engagement can potentially reduce readmission rates by over 30%.

b. Patient Monitoring – Inpatient, Out-Patient, Emergency Visits, Intensive Care Units…

With rapid progress in technology, sensors are embedded in your weighing scales, glucose devices, wheel chairs, patient beds, XRay machines. All these large streams of data generated in real-time can provide real insights into patient health and behavior. This will improve the accuracy of information and significantly reduce the cost of healthcare providers. It will also significantly enhance patient experience at healthcare facility by providing proactive risk monitoring, improved quality of care and personalized attention. Big Data can enable CEP – complex event processing providing real-time insights to doctors and nurses in control room.

c. Preventive care for ACO

One of the key ACO goals is to provide preventive care to its members. The Disease identification and Risk Stratification will be very crucial business function. Managing real-time feeds coming in from HIE from Pharmacists, Providers and Payers will be key information to apply risk stratification and predictive modeling techniques. In the past, companies were limited to historical claims and HRA/Survey data but with HIE, the whole dynamic to data availability for health analytics has changed. Big Data tools can significantly enhance the speed of processing and data mining.

d. Provider Sentiment Analysis 

With social media growing at rapid pace, members are sharing their experience about providers through social channels – Facebook, Twitter, and other media. These experiences through comments, twitter feeds, blogs, surveys can be mined for gaining rich insights about quality of services.

e. Epidemiology

Through HIE, most of the providers, payers and pharmacists will be connected through network in few months to come. These will allow hospitals and health agencies to track disease outbreaks, patterns and trends in health issues across geography allowing determination of source and containment plans.

f. Patient care quality and program analysis

Natural with growth of data and insight into new information, comes the challenge to process these voluminous and variety of information to produce metrics and KPIs for Patient care quality and program. Big data provides the architecture, tools and techniques that will allow to process TB and Petabytes of data to provide deep health care analytics capabilities to its stakeholders.

Some of the key use cases for Payer industryare

a. Clinical Data analysis for improved predictable outcomes

Payer/Health Plans and Insurance companies can significantly reduce cost of care by reducing readmission, improved outcomes and proactive patient monitoring. There is a huge amount of existing clinical data that resides within organization and myriads of unstructured data coming at rapid space, Big data will be candidate to process these complex events and data to provide clinical insights to payer organization. Some of the areas that can be immediately addressed by Big data solutions:

  • Longitudinal analysis of care across patients and diagnoses; time sequencing
  • Cluster Analysis around influencers on treatment, physicians, therapist; patient social relationships
  • Analyze clinical notes (multi-structured data); no longer limited by dimensional sentiment of a relational database
  • Analyze click stream data and clinical outcomes; look for patterns/ trends to quality of care delivered.
  • Clinical outcomes can be integrated with financial information to understand performance

b. Claims Fraud Detection

Although no precise dollar amount can be determined, some authorities contend that insurance fraud constitutes a $100-billion-a-year problem. The United States Government Accountability Office (GAO) estimates that $1 out of every $7 spent on Medicare is lost to fraud. Some of the fraud examples are:

  • Billing for services, procedures, and/or supplies that were not provided.
  • Misrepresentation of what was provided; when it was provided; the condition or diagnosis; the charges involved; and/or the identity of the provider recipient.
  • Providing unnecessary services or ordering unnecessary tests
  • Billing separately for procedures that normally are covered by a single fee.
  • Charging more than once for the same service.
  • Upcoding: Charging for a more complex service than was performed. This usually involves billing for longer or more complex office visits
  • Miscoding: Using a code number that does not apply to the procedure.
  • Kickbacks: Receiving payment or other benefit for making a referral.

With Health Information Exchanges playing a pivotal role in real-time information sharing, Payer organization will have the power of information to proactively detect frauds using Pattern Analysis, Graph Analysis of cohort networks, social media insights.

c. Member Engagement

Like any industry, Payer organization like Health Insurance companies are battling to win member business. Companies are monitoring members, prospects behavior on their websites and social media.

d. Payer Sentiment Analysis 

Similar to Provider sentiment analysis, members are sharing their experience about insurance benefits, customer service experience through social channels – Facebook, Twitter, and other media. These experiences through comments, twitter feeds, blogs, surveys can be mined for gaining rich insights to improve quality of services.

e. Call Center Analysis

Payer organizations are capturing information from Call Center using call recording. These call records provide valuable information to

  • staffing model – by demographic preferences, hours of services
  • member feedback using voice pattern and recognition
  • member experience using metrics – Average speed to answer, abandonment rate, dropped calls, unable to reach member

Finally, few of the use cases for Care Management – Disease Management, Utilization Management and Behavioral Health Management industry are:

a. Disease Identification and Risk Stratification

Care management companies constantly collect data from various sources – claims, prior authorizations, biometrics screening, health risk assessment and survey data. Disease ID and Risk Stratification is key function that helps organization with limited resources to focus on top 5-10% of high risk population that takes 60-80% of medical cost. Processing through 10′s of years of historical information added with realtime information from various sources adds a huge complexity and processing challenges. Big data can alleviate such challenges by not only providing accessibility to  unstructured data but also providing the robustness and speed of processing.

b. Member Sentiment Analysis 

With social media growing at rapid pace, members are sharing their experience about providers through social channels – Facebook, Twitter, and other media. These experiences through comments, twitter feeds, blogs, surveys can be mined for gaining rich insights about quality of services.

c. Member care quality and program analysis

Natural with growth of data and insight into new information, comes the challenge to process these voluminous and variety of information to produce metrics and KPIs for Member care quality and program. Big data provides the architecture, tools and techniques that will allow to process TB and Petabytes of data to provide deep health care analytics capabilities to its stakeholders.

While these are just few of the generic use cases in Healthcare industry, there are a lot of unique use cases specific to your line of business, organization and department. I will reiterate again that assessing and prioritizing the business use case for Big data based on value is key to its success and will have significant impact on your organization in years to come. Think Big, start Small!

7 steps to Advanced Predictive Analytics!

Predictive analytics encompasses a variety of statistical techniques from modelling, machine learning, data mining and that analyze current and historical facts to make predictions about future events. In business, predictive models exploit patterns found in historical and transactional data to identify risks and opportunities. Models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding decision making for candidate transactions.

Till late 2010, most of the enterprise business intelligence focus and analytics was around structured and some of semi-structured data (emails, logs and call records). The buzz words were ‘data-driven’ business decisions or actions. With era of big data and technology evolution, organizations are now venturing into unstructured data like sensor data from RFID tags, online web content and need for better analytical capabilities. The new buzz words that’s rapidly gaining popularity from meetings to conferences is ‘advanced analytics’ or ‘analytics-driven’ or ‘big insights’. With big data technology, the power of predictive analytics is getting a lot of coverage, and software vendors are touting the latest and greatest in technology and algorithms. Once seemed as boring and dull job or data geeks, the demand for statisticians and data scientists are high and rising. One of the recent articles from Harvard Business Review (HBR) – Oct 2012, it talks about Data Scientist: The Sexiest Job of the 21st Century.

As noted in HBR article, ‘Data are essential, but performance improvements and competitive advantage arise from analytics models that allow managers to predict and optimize outcomes. More important, the most effective approach to building a model rarely starts with the data; instead it originates with identifying the business opportunity and determining how the model can improve performance. According to research by Andrew McAfee and Erik Brynjolfsson, of MIT, companies that inject big data and analytics into their operations show productivity rates and profitability that are 5% to 6% higher than those of their peers. Often organizations look for help to start with their Advanced Predictive Analytics project. There are very limited production processes that leverage the power of Predictive Analytics that is embedded in BPM or decision making processes. In my recent experience from a Analytics Strategy and Assessment workshop for a client, the issue identified was not the tool or architecture but how to get insights from myriads of data that exist. The questions posed was a) Are we collecting and storing right data? b) What insights can be generated from this data?  What organizations want to know is not what kind of technology to buy first or what techniques and training they need, but what kind of problem to go looking for. What kind of problem will show the greatest return on an investment in predictive analytics? Where can they apply predictive analytics and get a clear and compelling “win”?

Based on my experience, here’s the key 7 steps to get started with Advanced Predictive Analytics

1. Define the Problem or Pain or Opportunity

2. Identify the key Metrics

3. Identify the Right Data that support metrics #2 above

4. Analyze and Enrich the data

5. Build models for Advanced Predictive Analysis

6. Experiment the model with test subject/group

7. Embed and Implement the analytics as part of business process or application

Most of the organizations are lost in the process of identifying or collecting data before documenting the step# 1. As mentioned above from my recent experience, the organization was collecting and storing all possible data without defining what insights they want to generate to drive business at tactical and strategic level.  Operational decisions align well with Predictive Analytics model. Most of the operational and tactical decisions are made by front line field level staff from call centers to customer service to sales reps. Predictive Analysis that can be embedded into business applications as part of workflow processes can add a tremendous value to providing an excellent service to customers as well as significantly increase results and outcome from cross-sell or up-sell opportunities. Many predictive analytic tools support access to a wide range of data sources, including those typically branded “big data,” such as unstructured text , or semi-structured Web logs and sensor data. The problem is that organizations are trying to apply these technologies to the wrong problem. With the urge to prove ROI from investments in Big Data and Analytics, organizations focus on large and wider problems than focusing on Operational level or tactical problem that can give an opportunity to implement and prove the solution.

Predictive models analyze past performance to assess how likely a customer is to exhibit a specific behavior in the future in order to improve marketing effectiveness. This category also encompasses models that seek out subtle data patterns to answer questions about customer performance, such as fraud detection models. Predictive models often perform calculations during live transactions, for example, to evaluate the risk or opportunity of a given customer or transaction, in order to guide a decision. Big data and predictive analytics can be combined together in the operational environment. By focusing on operational decisions, we can put big data to work, using it to drive predictions that improve our ability to make good decisions at the operational level. With advancement in computing speed, individual agent modeling systems can simulate human behavior or reaction to given stimuli or scenarios. The new term for animating data specifically linked to an individual in a simulated environment is Avatar Analytics.

From my recent analytics on Social media marketing for a specific brand, we found out some interesting statistics like 23% of customers repurchase on same day and most of the repurchases happens within 5 days of initial purchase. These are key insights for marketing campaign. Healthcare and Financial organizations has a huge potential to leverage predictive analytics for fraud detection, cross-sell and up-sell and interventions. In my experience at one of the leading healthcare companies where I managed the BI and Analytics team, analytics played a significant role on how members/patients were stratified by risk scores using statistical models. Organizations are gaining momentum in leveraging big data for insights and advanced analytics using statistical model and predictive modeling across many industry domains. Whether it is using analytics to predict customer behavior, set pricing strategy, optimize ad spending or manage risk, analytics is moving to the top of the management agenda.

Policy for Establishing New TBDI Chapters

Policy for Establishing New Chapters

1.      Establishment of Chapters

Any TBDI Global Member in good standing may submit an application to form a Chapter to TBDI.

If the application is approved, the Chapter will be granted a status of “In Formation”. During this time applicants recruit members, draft a set of By-laws for the Chapter, and complete other such requirements as set forth by TBDI procedures.

Upon receipt of the By-laws and other application requirements, TBDI will review the application and determine if a Charter is to be granted. If an application is rejected, applicants may appeal to the CEO and President of TBDI by submitting a written request (email is acceptable).

2.      Purpose of Chapters

Chapters of The Big Data Institute are expected to serve the interests of a segment of the global big data analytics community in a manner consistent with the principles of TBDI. Through a presence local to its community of interest, a Chapter focuses on issues and developments important to its community. A Chapter recognizes, honours, and uses the culture, customs, and language of its community. Every Chapter shall have an explicit statement of purpose.

3.      Scope of Chapters

Chapters may be established on a non-exclusive basis to serve the needs of any specific, cohesive community of interest. Multiple Chapters serving overlapping communities are not permitted.

4.      Funding of Chapters

Chapters are expected to establish their own source of funding. Permitted sources include the following:

    1. Chapters may establish a fee-based membership model, charging either or both individuals and organizations to participate in its activities. The fees may be structured according to the activities or paid according to a regular renewal schedule.
    2. Chapters may solicit funding or resources from local organizations or other sponsors to support its activities.


5.      Public Positions and Statements

Specific officials of Chapters, acting on behalf of their Chapter, may make public statements and establish public positions as long as they meet the following requirements:

    1. They must advance the purposes of the TBDI, which includes advancing the purposes of a Chapter in good standing.
    2. They must not be contrary to any position of the TBDI.
    3. They must be prepared and presented in a professional manner.
    4. They must be clearly and unambiguously identified as originating from the Chapter of the TBDI.
    5. It should be unlikely they will give rise to any significant legal or juridical liability.

Where there is any question or doubt regarding the appropriateness of a public position or statement, the Chapter is expected to consult with the TBDI at least one week prior to its release or announcement. Chapters must notify the TBDI no later than the same day of the release of any public position or statement.

6.      Members

All individuals and organizations falling within the defined scope of the Chapter shall be eligible for membership without discrimination.

All members of a Chapter shall also be members of TBDI. Membership is not necessary, however, for participation in the activities of the Chapters.

The Chapter shall have at least 10 individual members none of whom are listed as one of the 10 members used by any currently active Chapter to get its Charter.

7.      Liabilities

The TBDI shall not be liable for any act or omission or incurred liability of any kind of any Chapter.

8.      Organization

Chapters will be encouraged but not required to constitute themselves as not-for-profit corporate entities.

Chapters must have a set of By-Laws.

Chapters must have a fixed postal address.

Chapters must have a defined set of leadership roles for which it conducts regular elections to select individuals from its membership to serve. Such roles may be appointed for at most 1 year when the Chapter first receives its Charter. Such roles may have whatever title is customary to the segment of the community being served.

The roles must include at least the following:

    1. One person designated as the most senior of all the leaders. This role frequently has the title of Chairman or President.
    2. One person to be responsible for the administrative duties. This role frequently has the title of Secretary.
    3. If the Chapter has financial resources to manage there must be one person responsible for the financial duties. This role frequently has the title of Treasurer.

Chapters must meet any requirements set forth by TBDI procedures by action of the TBDI Membership Team, including but not limited to at most an annual review of its Charter and the annual submission of a financial report.

9.      Activities

Chapter may undertake any activity reasonably related to and in furtherance of its purpose and the purpose and mission of TBDI.

Facebook and Twitter Social Media Analytics – From White House to Boardroom!

Facebook and Twitter have become an integral part of most of our lives. On average, Facebook users spent more than 10.5 billion minutes per day on Facebook during January 2012, excluding mobile devices. This comes out to 12 minutes and 26 seconds per user. Facebook crossed 1 billion users in August 2012, while Twitter surpassed 500M users this July 2012; this number has quadrupled in the last 2 years.

Even during the November 2012 United States Presidential election, social media was used as a communication tool to reach and connect with masses. As the race tightened in the closing weeks of this year’s campaign, President Obama maintained a substantial lead in both Facebook likes and Twitter followers over Governor Romney. By the end of the campaign, Obama had 22.7 million followers and 32.2 million likes, compared to Romney’s 1.8 million followers and 12.1 million likes. Twitter said it recorded more than 31 million election-related posts on Tuesday during election week, easily making the vote the most-tweeted political event in the social network’s six-year history. At one point, as election results were unfolding, users tweeted at a rate of 327, 452 tweets per minute.

While consumers may think of social media sites like Facebook, Twitter and Foursquare as places to post musings and interact with friends, companies like Wal-Mart and Samuel Adams are turning them into extensions of market research departments. And many companies are beginning to explore how to use the enormous amount of information available over social media to their advantage. For example, Wal-Mart’s social media unit, now called @WalmartLabs, looks at Twitter posts, public Facebook posts and search terms on among other cues, to help Wal-Mart refine what it sells. The enormous data in Facebook and Twitter is constantly mined and has become part of Supply Chain (Inventory Management and Distribution) decision-making processes.

Companies using data from social media said the ability to see what consumers do, want and are talking about on such a big scale, without consumers necessarily knowing the companies are listening in, was unprecedented. According to MerchantCircle, over 70 percent of local business uses Facebook to market their business, up from 50 percent one year ago. Facebook and Google are the most widely used marketing methods amongst local merchants, with 37 percent rating Facebook as one of their most effective tools, almost tied with Google search (40 percent). Twitter has also grown in popularity over the past year, with nearly 40 percent of local merchants using the microblogging platform to build awareness and community around their products and services.

The University of Massachusetts Dartmouth recently released its annual survey on the use of social media in “Fortune 500” (F500) companies; results indicate a spike in F500 implementation and activity on social platforms like Facebook, Twitter, and blog postings. According to the research, F500 companies are showing a renewed interest in how social media can be used for engagement, hiring, outreach and corporate advancement. Past survey results show Fortune 500 companies have typically lagged behind Inc. 500 companies in the realm of social media. However, the 2012 report shows, within the past year, F500 companies have increased their Facebook page usage by 8 percent, Twitter usage by 11 percent, and blog postings by 5 percent. At this time last year, 31 percent of F500 companies reported no activity on Twitter or Facebook; according to the survey, that number dropped to 23 percent in 2012.

To complete the research, UMass surveyed the F500 companies, which represent 71 different industries. Of these industries, 54 reported using a blog and posting at least once every 30 days. Eight of the top 10 companies surveyed are not part of the 66 percent of F500 companies that reported using a Facebook page. In the 2011 survey six of the top 10 F500 had an inactive Twitter account, but this year’s survey revealed that all six of those companies are now actively using Twitter.

Social media advertising is only going to grow over the next few years, according to multiple studies. Over the next five years, social media ad spend is expected to hit 19.5 percent of the total marketing budget within organizations. Currently it is at 7.4 percent. As per BIA/Kelsey, it is expected that social-media advertising spending may rise to $9.8B in 2016 from $3.8B last year.

Other Findings:

  • According to Marin Software, click-through rates (CTRs) for social ads have jumped 50 percent year-over-year.
  • Cost-per-click (CPC) for these ads has risen by an even more dramatic 86 percent year-over-year.
  • Looking at select countries across the globe, the report reveals Facebook ad CTRs to be highest in India (0.13%), with the US and Eurozone countries following (both at 0.06%).
  • Facebook ad CPCs are 50 percent higher in the UK than in the US ($0.75 vs. $0.50).
  • Cost-per-thousand impressions (CPM) is highest in the US ($0.32).

In a recent Text Analytics Summit West that I attended in San Francisco, one of the important themes of the summit was how social media was being used as an interesting tool by companies like EBay, Porter Novelli, American Airlines and many others. Companies like EBay and a few others in Silicon Valley have set up Social Listening centers that act as command centers listening to social conversations about their brand and the competitions’ brands in real-time.

As organizations have started using social media as new advertising and marketing channels to connect with customers, it has become utterly important for Sales, Advertising & Marketing and Customer Service departments to monitor interactions and communication on a real-time basis to avoid missed opportunities and to gain competitive information. It is critical to know who your brand influencers are and who is working for competitors.

Social media analytics solutions can help your business:

  • Capture consumer data from social media to understand attitudes, opinions, trends and manage online reputation
  • Predict customer behavior and improve customer satisfaction by recommending next best actions
  • Create customized campaigns and promotions that resonate with social media participants
  • Identify the primary influencers within specific social network channels

Here are some of the key metrics for Facebook and Twitter that can be generated to track Fans Activity, Product and Brand Feedback, Competitor Analysis and more.

Facebook Insights

  • Comment Keyword Analysis – Top 5/10/25
  • Consumption by Post Type – Photo Views, Video, Links
  • Consumption Metrics – Total Photo views vs. Video Plays vs. Clicks – Spread out by Day / Last 30 days
  • Engaged Users as % of Total Reach by Day / Last 30 days
  • Engagement (#) by Post type (Question, Poll, Video, Photo, Link, Status)
  • Engagement Metrics (Total and Average) – Stories vs. Likes vs. Comments vs. Shares vs. Clicks
  • Fan Page Demographic Profile – Gender, Age Group, Interests and Activities, Country
  • Fan Page Engagement and Reach over time – PTAT – Comments vs. Likes by Day / Last 30 days
  • Fan Page Activity over time – Wall Posts vs. Admin Wall Posts vs. Comments and Likes
  • Fan Sources – Shares, Page Profiles, Search, Ads, Recommendation, Mobile
  • Impression Metrics – Organic Impression vs. Viral Impression vs. Paid Impression – Spread out by Day / Last 30 days
  • Impressions & Reach by Post Type (Question, Poll, Video, Photo, Link, Status)
  • Key Metrics – Total Reach vs. Total Consumers vs. PTAT
  • Negative Feedback – Hide Clicks vs. Report as Spam vs. Hide All Clicks vs. Unlike Page
  • Ad Campaign – Admin Wall Posts by Post Type (Question, Poll, Video, Photo, Link, Status) over time
  • Ad Campaign – Admin Wall Posts vs. Fan Likes / Shares / Comments  – Total and Average over time
  • Page Likes and Growth over time – Total and Net Change by weeks/Months
  • Page Likes and Unlikes over time – Total and Net Change by weeks/Months
  • Post Type – Average Engagement vs. Total Engagement over time
  • Potential Reach vs. Total Reach
  • PTAT, Impressions, Comments, Engaged, Reach by Demographic Profiles – Gender, Age Group, Interests and Activities, Country
  • Reach Metrics – Organic Reach vs. Viral Reach vs. Paid Reach – Spread out by Day / Last 30 days
  • Top 10/25 Referrer sites and search key words
  • Total Impression and Total Page Views by Day / last 30 days
  • Total Reach vs. Engaged Users by Day / Last 30 days

Facebook Competition Analysis

All the below metrics can be broken out by leading competition brands

  • Total Fan vs. Engaged Fans
  • Relative Share of Engagement
  • Relative Share of Fans
  • Fan Page Engagement – Comparison over time (days/weeks/months)
  • Fan Page Engagement Breakdown – Fan Likes vs. Fan Comments vs. Fan Posts vs. Fan Unlikes vs. Fan Shares
  • Average Response and Comments by Post

Twitter Insights

  • Total Followers vs. Legitimate Followers – Spread over time
  • Followers by Demographic profile – Age, Gender, Time Zone, Country, Profession
  • Users By # of Followers
  • Users By Total Tweets
  • Followers by Date of Last Tweet
  • Followers by # of Followers – Total Potential Followers
  • Top 10/25 Tweets that was Retweeted
  • Effective Reach vs. Potential Reach over time period
  • Number of clicks on links
  • Website Traffic – Clicks/Traffic back to your site by top referral sources
  • Twitter Counter – Number of Followers, Number of Tweets, Number of Retweets, Mentions
  • Number of Followers – Total and Growth over time period
  • Influential Followers Count by time period and demographics
  • Conversion Count (Sign-ups and Purchases) by time period and demographics
  • Impressions
  • Tweet Density
  • Velocity – Tweets Retweeted, Follower Retweet %
  • Inbound Messages vs. Outbound Messages spread by time and other dimensions

Note: Most of the above metrics can be sliced by different dimensions – Age/Gender/User/Time zone/Market/Day/Time, etc.

While the list of metrics are evolving with new features and products added by Facebook, Twitter, Tumblr and other social networks, the above list gives us a good starting point to get control over advertising and budget. Social media analytics offers an entirely new paradigm for measuring interactive marketing by integrating, analyzing and enabling organizations to act on intelligence gained by expanding their reach, increasing retention, and ultimately, driving more revenue.