Data Mining with Hive – What is Hive? – What is HiveQL (HQL)?

Yahoo started working on PIG (we will understand that in the next blog post) for their application deployment on Hadoop. The goal of Yahoo to manage their unstructured data. Similarly Facebook started deploying their warehouse solutions on Hadoop which has resulted in HIVE. The reason for going with HIVE is because the traditional warehousing solutions are getting very expensive.

What is HIVE?

Hive is a datawarehouseing infrastructure for Hadoop. The primary responsibility is to provide data summarization, query and analysis. It  supports analysis of large datasets stored in Hadoop’s HDFS as well as on the Amazon S3 filesystem. The best part of HIVE is that it supports SQL-Like access to structured data which is known as HiveQL (or HQL) as well as big data analysis with the help of MapReduce. Hive is not built to get a quick response to queries but it it is built for data mining applications. Data mining applications can take from several minutes to several hours to analysis the data and HIVE is primarily used there.

HIVE Organization

The data are organized in three different formats in HIVE.
Tables: They are very similar to RDBMS tables and contains rows and tables. Hive is just layered over the Hadoop File System (HDFS), hence tables are directly mapped to directories of the filesystems. It also supports tables stored in other native file systems.
Partitions: Hive tables can have more than one partition. They are mapped to subdirectories and file systems as well.
Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition in the underlying file system.
Hive also has metastore which stores all the metadata. It is a relational database containing various information related to Hive Schema (column types, owners, key-value data, statistics etc.). We can use MySQL database over here.

What is HiveSQL (HQL)?

Hive query language provides the basic SQL like operations. Here are few of the tasks which HQL can do easily.
  • Create and manage tables and partitions
  • Support various Relational, Arithmetic and Logical Operators
  • Evaluate functions
  • Download the contents of a table to a local directory or result of queries to HDFS directory
Here is the example of the HQL Query:
SELECT Id, Name, Dept FROM employee ORDER BY DEPT;
SELECT Dept,count(*) FROM employee GROUP BY DEPT;
When you look at the above query, you can see they are very similar to SQL like queries.

No comments:

Post a Comment