Hi Guys!

This Blog Post is about Understanding Big Data. We will be seeing through What exactly Big Data is? What are its types and what are its characteristics and the use-cases of big data.

Word Cloud

So, What is Big Data ?

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy.

During 1990’s,When the IT Organizations were evolving, the employees in the particular organizations used to generate the data.

Before 2000

Later in 2000’s, when social networking sites, e-commerce websites came into existence, even users started generating data.

after 2000multitasking-mobile-devices-660x429Now after 2010’s due to emerging smartphone technologies and motion sensor techniques even devices started generating data.

How much data is generated per minute ?

  • Facebook users share nearly 2.5 million pieces of content.
  • Twitter users tweet nearly 300,000 times.
  • Instagram users post nearly 220,000 new photos.
  • YouTube users upload 72 hours of new video content.
  • Apple users download nearly 50,000 apps.
  • Email users send over 200 million messages.
  • Amazon generates over $80,000 in online sales.

Data is generated from almost everywhere!

Data is generated from Healthcare, Multi-channel Sales, Finance, Log Analysis, Homeland Security, Traffic Control, Telecom,Search Quality, Manufacturing, Trading Analytics, Fraud and Risks and Retail.


Data generated per minute

Data generated by Hadron Collider

Data generated by Hadron Collider

Types of Data:

[1] Structured Data

[2] Unstructured Data

[3] Semi Structured Data

[1] Structured Data:

  • Your current data warehouse contains structured data and only structured data.
  • It’s structured because when you placed it in your relational database system a structure was enforced on it, so we know where it is, what it means, and how it relates to other pieces of data in there.
  • It may be text (a person’s name) or numerical (their age) but we know that the age value goes with a specific person, hence structured.

[2] Unstructured Data:

  • Essentially everything else that has not been specifically structured is considered unstructured.
  • The list of truly unstructured data includes free text such as documents produced in your company, images and videos, audio files, and some types of social media.
  • If the object to be stored carries no tags (metadata about the data) and has no established schema, ontology, glossary, or consistent organization it is unstructured.
  • However, in the same category as unstructured data there are many types of data that do have at least some organization.

[3] Semi Structured Data:

  • The line between unstructured data and semi-structured is a little fuzzy.
  • If the data has any organizational structure (a known schema) or carries a tag (like XML extensible markup language used for documents on the web) then it is somewhat easier to organize and analyze, and because it is more accessible for analysis may make it more valuable.

Example: Text ( XML, Email), Web Server logs and server patterns, sensor data

Characterization of Big Data:


3V’s of Big Data :Picture13


Picture14Some call it as 4V’s

Picture15Applications and Use-cases of Big Data:


Popular Use-cases :

[1] A 360 degree view of the customer :

  • This use is most popular, according to Gallivan. Online retailers want to find out what shoppers are doing on their sites — what pages they visit, where they linger, how long they stay, and when they leave.
  • “That’s all unstructured clickstream data,” said Gallivan. “Pentaho takes that and blends it with transaction data, which is very structured data that sits in our customers’ ERP [business management] system that says what the customers actually bought.”

[2] Internet of Things :

  • The second most popular use case involves IoT-connected devices managed by hardware, sensor, and information security companies. “These devices are sitting in their customers’ environment, and they phone home with information about the use, health, or security of the device,” said Gallivan.
  • Storage manufacturer NetApp, for instance, uses Pentaho software to collect and organize “tens of millions of messages a week” that arrive from NetApp devices deployed at its customers’ sites. This unstructured machine data is then structured, put into Hadoop, and then pulled out for analysis by NetApp.

[3] Data warehouse optimization :

  • This is an “IT-efficiency play,” Gallivan said. A large company, hoping to boost the efficiency of its enterprise data warehouse, will look for unstructured or “active” archive data that might be stored more cost effectively on a Hadoop platform. “We help customers determine what data is better suited for a lower-cost computing platform.”

[4] Big data service refinery :

  • This means using big-data technologies to break down silos across data stores and sources to increase corporate efficiency.
  • A large global financial institution, for instance, wanted to move from next-day to same-day balance reporting for its corporate banking customers. It brought in Pentaho to take data from multiple sources, process and store it in Hadoop, and then pull it out again. This allowed the bank’s marketing department to examine the data “more on an intra-day than a longer-frequency basis,” Gallivan told us.
  • “It was about driving an efficiency gain that they couldn’t get with their existing relational data infrastructure. They needed big-data technologies to collect this information and change the business process.”

[5] Information security :

  • This last use case involves large enterprises with sophisticated information security architectures, as well as security vendors looking for more efficient ways to store petabytes of event or machine data.
  • In the past, these companies would store this information in relational databases. “These traditional systems weren’t scaling, both from a performance and cost standpoint,” said Gallivan, adding that Hadoop is a better option for storing machine data.

Traditional Databases :

  • The relational database management system (or RDBMS) had been the one solution for all database needs. Oracle, IBM (IBM), and Microsoft (MSFT) are the leading players of RDBMS.
  •  RDBMS uses structured query language (or SQL) to define, query, and update the database.
  • However, the volume and velocity of business data has changed dramatically in the last couple of years. It’s skyrocketing every day.
  • Limitations of RDBMS to support “big data” :
  • First, the data size has increased tremendously to the range of petabytes—one petabyte = 1,024 terabytes. RDBMS finds it challenging to handle such huge data volumes.
  • To address this, RDBMS added more central processing units (or CPUs) or more memory to the database management system to scale up vertically.
  • Second, the majority of the data comes in a semi-structured or unstructured format from social media, audio, video, texts, and emails.
  • However, the second problem related to unstructured data is outside the purview of RDBMS because relational databases just can’t categorize unstructured data.
  • They’re designed and structured to accommodate structured data such as weblog sensor and financial data.
  • Also, “big data” is generated at a very high velocity. RDBMS lacks in high velocity because it’s designed for steady data retention rather than rapid growth.
  • Even if RDBMS is used to handle and store “big data,”  it will turn out to be very expensive.
  • As a result, the inability of relational databases to handle “big data” led to the emergence of new technologies.