There are two most important
components one should learn when learning about interacting with Hadoop – Sqoop
and Zookper.
What is Sqoop?
Most of the business stores their
data in RDBMS as well as other data warehouse solutions. They need a way to
move data to the Hadoop system to do various processing and return it back to
RDBMS from Hadoop system. The data movement can happen in real time or at
various intervals in bulk. We need a tool which can help us move this data from
SQL to Hadoop and from Hadoop to SQL. Sqoop (SQL to Hadoop) is such a tool
which extract data from non-Hadoop data sources and transform them into the
format which Hadoop can use it and later it loads them into HDFS. Essentially
it is ETL tool where it Extracts, Transform and Load from SQL to Hadoop. The
best part is that it also does extract data from Hadoop and loads them to
Non-SQL (or RDBMS) data stores. Essentially, Sqoop is a command line tool which
does SQL to Hadoop and Hadoop to SQL. It is a command line interpreter. It
creates MapReduce job behinds the scene to import data from an external
database to HDFS. It is very effective and easy to learn tool for
nonprogrammers.
ZooKeeper is a centralized service
for maintaining configuration information, naming, providing distributed
synchronization, and providing group services. In other words Zookeeper is a replicated
synchronization service with eventual consistency. In simpler words – in Hadoop
cluster there are many different nodes and one node is master. Let us assume
that master node fails due to any reason. In this case, the role of the master
node has to be transferred to a different node. The main role of the master
node is managing the writers as that task requires persistence in order of
writing. In this kind of scenario Zookeeper will assign new master node and
make sure that Hadoop cluster performs without any glitch. Zookeeper is the
Hadoop’s method of coordinating all the elements of these distributed
systems. Here are few of the tasks which Zookeepr is responsible for.
- Zookeeper
manages the entire workflow of starting and stopping various nodes in the
Hadoop’s cluster.
- In Hadoop
cluster when any processes need certain configuration to complete the
task. Zookeeper makes sure that certain node gets necessary configuration
consistently.
- In case of the
master node fails, Zookeepr can assign new master node and make sure
cluster works as expected.
There many other tasks Zookeeper performance when it is about Hadoop
cluster and communication. Basically without the help of Zookeeper it is not
possible to design any new fault tolerant distributed application.
aweertty
ReplyDeletesdghjkl/
ReplyDelete