What is Big Data

There is no place where Big Data does not exist! The curiosity about what is Big Data has been soaring in the past few years. Let me tell you some mind-boggling facts! Forbes reports that every minute, users watch 4.15 million YouTube videos, send 456,000 tweets on Twitter, post 46,740 photos on Instagram and there are510,000 comments posted and 293,000 statuses updated on Facebook!
Just imagine the huge chunk of data that is produced with such activities. This constant creation of data using social media, business applications, telecom and various other domains is leading to the formation of Big Data.
In order to explain what is Big Data, I will be covering the following topics:
  • Evolution of Big Data
  • Big Data Defined
  • Characteristics of Big Data
  • Big Data Analytics
  • Industrial Applications of Big Data
  • The scope of Big Data
Evolution of Big Data
Before exploring what is Big Data, let me begin by giving some insight into why the term Big Data has gained so much importance.
When was the last time you guys remember using a floppy or a CD to store your data? Let me guess, had to go way back in the early 21st century right? The use of manual paper records, files, floppy and discs have now become obsolete. The reason for this is the exponential growth of data. People began storing their data in relational database systems but with the hunger for new inventions, technologies, applications with quick response time and with the introduction of the internet, even that is insufficient now. This generation of continuous and massive data can be referred to as Big Data. There are a few other factors that characterize Big Data which I will be explaining later in this blog.

Forbes reports that there are 2.5 quintillion bytes of data created each day at our current pace, but that pace is only accelerating. Internet of Things(IoT) is one such technology which plays a major role in this acceleration. 90% of all data today was generated in the last two years.
What is Big Data | Big Data Analytics |

Big Data Definition
What is Big Data?
So before I explain what is Big Data, let me also tell you what it is not! The most common myth associated with Big Data is that it is just about the size or volume of data. But actually, it’s not just about the “big” amounts of data being collected. Big Data refers to the large amounts of data which is pouring in from various data sources and has different formats. Even previously there was huge data which were being stored in databases, but because of the varied nature of this Data, the traditional relational database systems are incapable of handling this Data. Big Data is much more than a collection of datasets with different formats, it is an important asset which can be used to obtain enumerable benefits.
The three different formats of big data are:
  1. Structured: Organised data format with a fixed schema. Ex: RDBMS
  2. Semi-Structured: Partially organized data which does not have a fixed format. Ex: XML, JSON
  3. Unstructured: Unorganised data with an unknown schema. Ex: Audio, video files etc.
Characteristics of Big Data
These are the following characteristics associated with Big Data:

The above image depicts the five V’s of Big Data but as and when the data keeps evolving so will the V’s. I am listing five more V’s which have developed gradually over time:
  • Validity: correctness of data
  • Variability: dynamic behavior
  • Volatility: the tendency to change in time
  • Vulnerability: vulnerable to breach or attacks
  • Visualization: visualizing meaningful usage of data
Big Data Analytics
Now that I have told you what is Big Data and how it’s being generated exponentially, let me present to you a very interesting example of how Starbucks, one of the leading coffeehouse chain is making use of this Big Data.
I came across this article by Forbes which reported how Starbucks made use of Big Data to analyze the preferences of their customers to enhance and personalize their experience. They analyzed their member’s coffee buying habits along with their preferred drinks to what time of day they are usually ordering. So, even when people visit a “new” Starbucks location, that store’s point-of-sale system is able to identify the customer through their smartphone and give the barista their preferred order. In addition, based on ordering preferences, their app will suggest new products that the customers might be interested in trying. This my friends are what we call Big Data Analytics.
Basically, Big Data Analytics is largely used by companies to facilitate their growth and development. This majorly involves applying various data mining algorithms on the given set of data, which will then aid them in better decision making.
There are multiple tools for processing Big Data such as Hadoop, Pig, HiveCassandra, Spark, Kafka, etc. depending upon the requirement of the organization.
Big Data Applications
These are some of the following domains where Big Data Applications has been revolutionized:
  • Entertainment: Netflix and Amazon use Big Data to make shows and movie recommendations to their users.
  • Insurance: Uses Big data to predict illness, accidents and price their products accordingly.
  • Driver-less Cars: Google’s driver-less cars collect about one gigabyte of data per second. These experiments require more and more data for their successful execution.
  • Education: Opting for big data powered technology as a learning tool instead of traditional lecture methods, which enhanced the learning of students as well aided the teacher to track their performance better.
  • Automobile: Rolls Royce has embraced Big Data by fitting hundreds of sensors into its engines and propulsion systems, which record every tiny detail about their operation. The changes in data in real-time are reported to engineers who will decide the best course of action such as scheduling maintenance or dispatching engineering teams should the problem require it.
  • Government: A very interesting use of Big Data is in the field of politics to analyze patterns and influence election results. Cambridge Analytica Ltd. is one such organization which completely drives on data to change audience behavior and plays a major role in the electoral process.
Scope of Big Data
  • Numerous Job opportunities: The career opportunities pertaining to the field of Big data include, Big Data Analyst, Big Data Engineer, Big Data solution architect etc. According to IBM, 59% of all Data Science and Analytics (DSA) job demand is in Finance and Insurance, Professional Services, and IT.
  • Rising demand for Analytics Professional: An article by Forbes reveals that “IBM predicts demand for Data Scientists will soar by 28%”. By 2020, the number of jobs for all US data professionals will increase by 364,000 openings to 2,720,000 according to IBM.
  • Salary Aspects: Forbes reported that employers are willing to pay a premium of $8,736 above median bachelor’s and graduate-level salaries, with successful applicants earning a starting salary of $80,265
  • Adoption of Big Data analytics: Immense growth in the usage of big data analysis across the world..
The above image depicts the growing market revenue of Big Data in billion U.S. dollars from the year 2011 to 2027. So that was all about What is Big Data and I hope this blog was helpful
Big Data Tutorial
Big Data, haven’t you heard this term before? I am sure you have. In the last 4 to 5 years, everyone is talking about Big Data. But do you really know what exactly is this Big Data and how is it making an impact on our lives? In this Big Data Tutorial, I will give you a complete insight about Big Data.
Below are the topics which I will cover in this Big Data Tutorial:
·         Story of Big Data
·         Big Data Driving Factors
·         What is Big Data?
·         Big Data Characteristics
·         Types of Big Data
·         Examples of Big Data
·         Applications of Big Data
·         Challenges with Big Data Let me start this Big Data Tutorial with a short story.

Story of Big Data
In ancient days, people used to travel from one village to another village on a horse driven cart, but as the time passed, villages became towns and people spread out. The distance to travel from one town to the other town also increased. So, it became a problem to travel between towns, along with the luggage. Out of the blue, one smart fella suggested, we should groom and feed a horse more, to solve this problem. When I look at this solution, it is not that bad, but do you think a horse can become an elephant? I don’t think so.  Another smart guy said, instead of 1 horse pulling the cart, let us have 4 horses to pull the same cart. What do you guys think of this solution? I think it is a fantastic solution. Now, people can travel large distances in less time and even carry more luggage.
The same concept applies on Big Data. Big Data says, till today, we were okay with storing the data into our servers because the volume of the data was pretty limited, and the amount of time to process this data was also okay.  But now in this current technological world, the data is growing too fast and people are relying on the data a lot of times. Also the speed at which the data is growing, it is becoming impossible to store the data into any server.
Through this blog on Big Data Tutorial, let us explore the sources of Big Data, which the traditional systems are failing to store and process.

The quantity of data on planet earth is growing exponentially for many reasons. Various sources and our day to day activities generates lots of data. With the invent of the web, the whole world has gone online, every single thing we do leaves a digital trace. With the smart objects going online, the data growth rate has increased rapidly. The major sources of Big Data are social media sites, sensor networks, digital images/videos, cell phones, purchase transaction records, web logs, medical records, archives, military surveillance, eCommerce, complex scientific research and so on. All these information amounts to around some Quintillion bytes of data. By 2020, the data volumes will be around 40 Zettabytes which is equivalent to adding every single grain of sand on the planet multiplied by seventy-five.
What is Big Data?
Big Data is a term used for a collection of data sets that are large and complex, which is difficult to store and process using available database management tools or traditional data processing applications. The challenge includes capturing, curating, storing, searching, sharing, transferring, analyzing and visualization of this data.
Big Data Characteristics
The five characteristics that define Big Data are: Volume, Velocity, Variety, Veracity and Value.
Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace. The size of data generated by humans, machines and their interaction on social media itself is massive. Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be generated by 2020, which is an increase of 300 times from 2005.

Velocity is defined as the pace at which different sources generate the data every day. This flow of data is massive and continuous. There are 1.03 billion Daily Active Users (Facebook DAU) on Mobile as of now, which is an increase of 22% year-over-year. This shows how fast the number of users are growing on social media and how fast the data is getting generated daily. If you are able to handle the velocity, you will be able to generate insights and take decisions based on real-time data. 

As there are many sources which are contributing to Big Data, the type of data they are generating is different. It can be structured, semi-structured or unstructured. Hence, there is a variety of data which is getting generated every day. Earlier, we used to get the data from excel and databases, now the data are coming in the form of images, audios, videos, sensor data etc. as shown in below image. Hence, this variety of unstructured data creates problems in capturing, storage, mining and analyzing the data.

Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. In the image below, you can see that few values are missing in the table. Also, a few values are hard to accept, for example – 15000 minimum value in the 3rd row, it is not possible. This inconsistency and incompleteness is Veracity.
Data available can sometimes get messy and maybe difficult to trust. With many forms of big data, quality and accuracy are difficult to control like Twitter posts with hashtags, abbreviations, typos and colloquial speech. The volume is often the reason behind for the lack of quality and accuracy in the data.

·     Due to the uncertainty of data, 1 in 3 business leaders don’t trust the information they use to make decisions.
·     It was found in a survey that 27% of respondents were unsure of how much of their data was inaccurate.
·     Poor data quality costs the US economy around $3.1 trillion a year.

5.   VALUE
After discussing Volume, Velocity, Variety and Veracity, there is another V that should be taken into account when looking at Big Data I.e. Value. It is all well and good to have access to big data but unless we can turn it into value it is useless. By turning it into value I mean, is it adding to the benefits of the organizations who are analyzing big data? Is the organization working on Big Data achieving high ROI (Return On Investment)? Unless, it adds to their profits by working on Big Data, it is useless.

Types of Big Data
Big Data could be of three types:
·         Structured
·         Semi-Structured
·         Unstructured

1.   Structured
The data that can be stored and processed in a fixed format is called as Structured Data. Data stored in a relational database management system (RDBMS) is one example of ‘structured’ data. It is easy to process structured data as it has a fixed schema. Structured Query Language (SQL) is often used to manage such kind of Data.
2.   Semi-Structured
Semi-Structured Data is a type of data which does not have a formal structure of a data model, i.e. a table definition in a relational DBMS, but nevertheless it has some organizational properties like tags and other markers to separate semantic elements that makes it easier to analyze. XML files or JSON documents are examples of semi-structured data.
3.   Unstructured
The data which have unknown form and cannot be stored in RDBMS and cannot be analyzed unless it is transformed into a structured format is called as unstructured data. Text Files and multimedia contents like images, audios, videos are example of unstructured data. The unstructured data is growing quicker than others, experts say that 80 percent of the data in an organization are unstructured. 
Till now, I have just covered the introduction of Big Data. Furthermore, this Big Data tutorial talks about examples, applications and challenges in Big Data.
Examples of Big Data
Daily we upload millions of bytes of data. 90 % of the world’s data has been created in last two years.

·         Walmart handles more than 1 million customer transactions every hour.
·         Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
·         230+ millions of tweets are created every day.
·         More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide.
·         YouTube users upload 48 hours of new video every minute of the day.
·         Amazon handles 15 million customer click stream user data per day to recommend products.
·         294 billion emails are sent every day. Services analyses this data to find the spams.
·         Modern cars have close to 100 sensors which monitors fuel level, tire pressure etc. , each vehicle generates a lot of sensor data.
Applications of Big Data
We cannot talk about data without talking about the people, people who are getting benefited by Big Data applications. Almost all the industries today are leveraging Big Data applications in one or the other way.

·  Smarter Healthcare: Making use of the petabytes of patient’s data, the organization can extract meaningful information and then build applications that can predict the patient’s deteriorating condition in advance.
· Telecom: Telecom sectors collects information, analyzes it and provide solutions to different problems. By using Big Data applications, telecom companies have been able to significantly reduce data packet loss, which occurs when networks are overloaded, and thus, providing a seamless connection to their customers.
·  Retail: Retail has some of the tightest margins, and is one of the greatest beneficiaries of big data. The beauty of using big data in retail is to understand consumer behavior. Amazon’s recommendation engine provides suggestion based on the browsing history of the consumer.
·   Traffic control: Traffic congestion is a major challenge for many cities globally. Effective use of data and sensors will be key to managing traffic better as cities become increasingly densely populated.
· Manufacturing: Analyzing big data in the manufacturing industry can reduce component defects, improve product quality, increase efficiency, and save time and money.
· Search Quality: Every time we are extracting information from google, we are simultaneously generating data for it. Google stores this data and uses it to improve its search quality.
Someone has rightly said: Not everything in the garden is Rosy!Till now in this Big Data tutorial, I have just shows you the rosy picture of Big Data. But if it was so easy to leverage Big data, don’t you think all the organizations would invest in it? Let me tell you upfront, that is not the case. There are several challenges which come along when you are working with Big Data.
Now that you are familiar with Big Data and its various features, the next section of this blog on Big Data Tutorial will shed some light on some of the major challenges faced by Big Data.
Challenges with Big Data
Let me tell you few challenges which come along with Big Data:
1.  Data Quality – The problem here is the 4th V i.e. Veracity. The data here is very messy, inconsistent and incomplete. Dirty data cost $600 billion to the companies every year in the United States.
2. Discovery – Finding insights on Big Data is like finding a needle in a haystack. Analyzing petabytes of data using extremely powerful algorithms to find patterns and insights are very difficult.
3. Storage – The more data an organization has, the more complex the problems of managing it can become. The question that arises here is “Where to store it?”. We need a storage system which can easily scale up or down on-demand.
4.  Analytics – In the case of Big Data, most of the time we are unaware of the kind of data we are dealing with, so analyzing that data is even more difficult.
5.  Security – Since the data is huge in size, keeping it secure is another challenge. It includes user authentication, restricting access based on a user, recording data access histories, proper use of data encryption etc.
6.  Lack of Talent – There are a lot of Big Data projects in major organizations, but a sophisticated team of developers, data scientists and analysts who also have sufficient amount of domain knowledge is still a challenge.
Hadoop to the Rescue
We have a savior to deal with Big Data challenges – its Hadoop. Hadoop is an open source, Java-based programming framework that supports the storage and processing of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
Hadoop with its distributed processing and handles large volumes of structured and unstructured data more efficiently than the traditional enterprise data warehouse. Hadoop makes it possible to run applications on systems with thousands of commodity hardware nodes, and to handle thousands of terabytes of data. Organizations are adopting Hadoop because it is an open source software and can run on commodity hardware (your personal computer). The initial cost savings are dramatic as commodity hardware is very cheap. As the organizational data increases, you need to add more & more commodity hardware on the fly to store it and hence, Hadoop proves to be economical. Additionally, Hadoop has a robust Apache community behind it that continues to contribute to its advancement.

No comments:

Post a Comment