Introduction to Big Data :
There is no place where Big Data
does not exist! The curiosity about what is Big Data has been soaring in
the past few years. Let me tell you some mind-boggling facts! Forbes
reports that every minute, users watch 4.15 million YouTube videos, send 456,000 tweets on Twitter, post 46,740 photos on Instagram and there are 510,000 comments posted and 293,000 statuses updated on Facebook!Just
imagine the huge chunk of data that is produced with such activities.
This constant creation of data using social media, business
applications, telecom and various other domains is leading to the
formation of Big Data.
In order to explain what is Big Data, I will be covering the following topics:
- Evolution of Big Data
- Big Data Defined
- Characteristics of Big Data
- Big Data Analytics
- Industrial Applications of Big Data
- Scope of Big Data
Evolution of Big Data
Before exploring any further, let me begin by giving some insight into why the this technology has gained so much importance.
When
was the last time you guys remember using a floppy or a CD to store
your data? Let me guess, had to go way back in the early 21st century
right? The use of manual paper records, files, floppy and discs have now
become obsolete. The reason for this is the exponential growth of data.
People began storing their data in relational database systems but with
the hunger for new inventions, technologies, applications with quick
response time and with the introduction of the internet, even that is
insufficient now. This generation of continuous and massive data can be
referred to as Big Data. There are a few other factors that characterize
Big Data which I will be explaining later in this blog.
Forbes
reports that there are 2.5 quintillion bytes of data created each day
at our current pace, but that pace is only accelerating. Internet of
Things(IoT) is one such technology which plays a major role in this
acceleration. 90% of all data today was generated in the last two years.
What is Big Data?
So
before I explain what is Big Data, let me also tell you what it is not!
The most common myth associated with it is that it is just about the
size or volume of data. But actually, it’s not just about the “big”
amounts of data being collected. Big Data refers
to the large amounts of data which is pouring in from various data
sources and has different formats. Even previously there was huge data
which were being stored in databases, but because of the varied nature
of this Data, the traditional relational database systems are incapable
of handling this Data. Big Data is much more than a collection of
datasets with different formats, it is an important asset which can be
used to obtain enumerable benefits.
The three different formats of big data are:
- Structured: Organised data format with a fixed schema. Ex: RDBMS
- Semi-Structured: Partially organised data which does not have a fixed format. Ex: XML, JSON
- Unstructured: Unorganised data with an unknown schema. Ex: Audio, video files etc.
Characteristics of Big Data
Following are the characteristics:

Big Data Analytics
Now
that I have told you what is Big Data and how it’s being generated
exponentially, let me present to you a very interesting example of how Starbucks, one of the leading coffeehouse chain is making use of this Big Data.
I came across this article by Forbes which reported how Starbucks
made use of this technology to analyse the preferences of their
customers to enhance and personalize their experience. They analysed
their member’s coffee buying habits along with their preferred drinks to
what time of day they are usually ordering. So, even when people visit a
“new” Starbucks location, that store’s point-of-sale system is able to
identify the customer through their smartphone and give the barista
their preferred order. In addition, based on ordering preferences, their
app will suggest new products that the customers might be interested in
trying. This my friends is what we call Big Data Analytics.
Big Data Applications
These are some of the following domains where Big Data Applications has been revolutionized:
- Entertainment: Netflix and Amazon use it to make shows and movie recommendations to their users.
- Insurance: Uses this technology to predict illness, accidents and price their products accordingly.
- Driver-less Cars:
Google’s driver-less cars collect about one gigabyte of data per
second. These experiments require more and more data for their
successful execution.
- Education: Opting
for big data powered technology as a learning tool instead of
traditional lecture methods, which enhanced the learning of students as
well aided the teacher to track their performance better.
- Automobile:
Rolls Royce has embraced this technology by fitting hundreds of sensors
into its engines and propulsion systems, which record every tiny detail
about their operation. The changes in data in real-time are reported to
engineers who will decide the best course of action such as scheduling
maintenance or dispatching engineering teams should the problem require
it.
- Government: A very interesting
use case is in the field of politics to analyse patterns and influence
election results. Cambridge Analytica Ltd. is one such organisation
which completely drives on data to change audience behaviour and plays a
major role in the electoral process.
Scope of Big Data
- Numerous Job opportunities:
The career opportunities pertaining to the field of Big data include,
Big Data Analyst, Big Data Engineer, Big Data solution architect etc.
According to IBM, 59% of all Data Science and Analytics (DSA) job demand
is in Finance and Insurance, Professional Services, and IT.
- Rising demand for Analytics Professional: An
article by Forbes reveals that “IBM predicts demand for Data Scientists
will soar by 28%”. By 2020, the number of jobs for all US data
professionals will increase by 364,000 openings to 2,720,000 according
to IBM.
- Salary Aspects: Forbes
reported that employers are willing to pay a premium of $8,736 above
median bachelor’s and graduate-level salaries, with successful
applicants earning a starting salary of $80,265
- Adoption of Big Data analytics: Immense growth in the usage of big data analysis across the world.
What is Hadoop? Introduction to Big Data & Hadoop
What is Hadoop?
Hadoop
is a framework that allows you to first store Big Data in a distributed
environment, so that, you can process it parallely. There are basically two components in Hadoop:

The first one is HDFS for storage
(Hadoop distributed File System), that allows you to store data of
various formats across a cluster. The second one is YARN, for resource management in Hadoop. It allows parallel processing over the data, i.e. stored across HDFS.
HDFS
HDFS
creates an abstraction, let me simplify it for you. Similar as
virtualization, you can see HDFS logically as a single unit for storing
Big Data, but actually you are storing your data across multiple nodes
in a distributed fashion. HDFS follows master-slave architecture.

Hadoop-as-a-Solution
Let’s understand how Hadoop provided the solution to the Big Data problems that we just discussed.
The first problem is storing Big data.
HDFS
provides a distributed way to store Big data. Your data is stored in
blocks across the DataNodes and you can specify the size of blocks.
Basically, if you have 512MB of data and you have configured HDFS such
that, it will create 128 MB of data blocks. So HDFS will
divide data into 4 blocks as 512/128=4 and store it across different
DataNodes, it will also replicate the data blocks on different
DataNodes. Now, as we are using commodity hardware, hence storing is not
a challenge.
It also solves the scaling problem. It focuses on horizontal scaling
instead of vertical scaling. You can always add some extra data nodes
to HDFS cluster as and when required, instead of scaling up the
resources of your DataNodes. Let me summarize it for you basically for
storing 1 TB of data, you don’t need a 1TB system. You can instead do it
on multiple 128GB systems or even less.
Next problem was storing the variety of data.
With HDFS you can store all kinds of data whether it is structured, semi-structured or unstructured. Since in HDFS, there is no pre-dumping schema validation.
And it also follows write once and read many model. Due to this, you
can just write the data once and you can read it multiple times for
finding insights.
Third challenge was accessing & processing the data faster.
Yes,
this is one of the major challenges with Big Data. In order to solve
it, we move processing to data and not data to processing. What does it
mean? Instead of moving data to the master node and then processing it.
In MapReduce, the processing logic is sent to the various slave nodes
& then data is processed parallely across different slave nodes.
Then the processed results are sent to the master node where the results
is merged and the response is sent back to the client.
In
YARN architecture, we have ResourceManager and NodeManager.
ResourceManager might or might not be configured on the same machine as
NameNode. But, NodeManagers should be configured on the same machine where DataNodes are present.
YARN
YARN performs all your processing activities by allocating resources and scheduling tasks.
Where is Hadoop used?
Hadoop is used for:
- Search – Yahoo, Amazon, Zvents
- Log processing – Facebook, Yahoo
- Data Warehouse – Facebook, AOL
- Video and Image Analysis – New York Times, Eyealike
Till
now, we have seen how Hadoop has made Big Data handling possible. But
there are some scenarios where Hadoop implementation is not recommended.
When not to use Hadoop?
Following are some of those scenarios :
- Low Latency data access : Quick access to small parts of data
- Multiple data modification : Hadoop is a better fit only if we are primarily concerned about reading data and not modifying data.
- Lots of small files : Hadoop is suitable for scenarios, where we have few but large files.
After knowing the best suitable use-cases, let us move on and look at a case study where Hadoop has done wonders.
Hadoop-CERN Case Study
The Large Hadron Collider
in Switzerland is one of the largest and most powerful machines in the
world. It is equipped with around 150 million sensors, producing a
petabyte of data every second, and the data is growing continuously.
CERN
researches said that this data has been scaling up in terms of amount
and complexity, and one of the important task is to serve these scalable
requirements. So, they setup a Hadoop cluster. By using Hadoop, they
limited their cost in hardware and complexity in maintenance.
They integrated Oracle & Hadoop and they got advantages of integrating. Oracle, optimized their Online Transactional System & Hadoop provided them scalable distributed data processing platform. They
designed a hybrid system, and first they moved data from Oracle to
Hadoop. Then, they executed query over Hadoop data from Oracle using
Oracle APIs. They also used Hadoop data formats like Avro & Parquet for high performance analytics without need of changing the end-user apps connecting to Oracle.
Here’s How Facebook Manages Big Data
Facebook analytics chief Ken Rudin says that Big Data is crucial to the company’s very being. And he has very particular ideas about how it should be managed.
“Facebook could not be Facebook without Big Data technologies,” Mr. Rudin said in an interview with CIO Journal.
Facebook’s success with Big Data isn’t shared by that many companies, though. As CIO Journal has reported, studies suggest that many companies are frustrated with the technology. Most recently, a Bain & Co. survey found that only 4% of business leaders are satisfied with the results of their Big Data efforts.
“The technology is really powerful, and people sometimes assume that any information they get out of it is valuable. For those people, it doesn’t work. Others are seeing huge benefits,” Mr. Rudin says.
Here’s how Facebook approaches Big Data, a broad term that includes the hardware and software used to analyze massive and often disparate data sets, as well as the data itself.
“I would say investments in Big Data frequently don’t pay off, but to me, that has nothing to do with technology,” Mr. Rudin said. Companies that invest in Big Data often owe their frustration to two mistakes, he said.
Mistake number one: They tend to rely too much on one technology, such as Hadoop. That isn’t to say that the technology isn’t effective, but rather that expectations about its role in an organization need to be reset.
In fact, Facebook relies on a massive installation of Hadoop, a highly scalable open-source framework that uses clusters of low-cost servers to solve problems. Facebook even designs its own hardware for this purpose. Mr. Rudin says Hadoop is just one of many Big Data technologies employed at Facebook. “Hadoop is not enough,” he says.
The analytic process at Facebook begins with a 300 petabyte data analysis warehouse. To answer a specific query, data is often pulled out of the warehouse and placed into a table so that it can be studied, he said. Mr. Rudin’s team also built a search engine that indexes data in the warehouse. These are just some of many technologies that Facebook uses to manage and analyze information.
The other major mistake that people make with Big Data, Mr. Rudin says, is that they often use it to answer meaningless questions. At Facebook, a meaningful question is defined as one that leads to an answer that provides a basis for changing behavior. If you can’t imagine how the answer to a question would lead you to change your business practices, the question isn’t worth asking, Mr. Rudin said. For example, one might use Big Data to determine whether women share more information online than men do. But what’s the point of knowing the answer? “What am I going to do with that? Ask men to become women?” Mr. Rudin asked. “Unless you are looking for answers that people can do something about, it is a waste of time.”
Mr. Rudin judges the success of his analytics team in terms of the business, not in terms of the quality of the information that it produces. It doesn’t matter how brilliant the analysis might be, if the business unit can’t translate it into action. “That means we didn’t explain it well enough. It’s our fault,” he said.