Hive

What is Hive

Hive is a data warehousing infrastructure based on the Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs).


1.       To start Hive server from the build directory run :
2.       cd /home/hive/build/dist/bin
            ./hive --service hiveserver
3.       To start command line interface
             ./hive
4.       Command line interface can be used to execute sql statements(hive sqls)
5.       Derby database is used to store the metadata. metastore_db is the directory which contains 
          all metadata of the database.
6.       Hive Commands :
           a.       Show databases;
           b.      Show tables;
7.       Example of create table :
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,                 page_url STRING, referrer_url STRING,     ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE  LOCATION '/home/hdfs/page_view';

8.       Load data example :
  LOAD DATA LOCAL INPATH '/home/hive/build/dist/bin/t.txt' OVERWRITE INTO TABLE page_view;
9.  Select * from page_view;
10.Aggregation Example :
   INSERT OVERWRITE TABLE pv_gender_agg
   SELECT pv_users.gender, count(DISTINCT pv_users.userid),count(*), sum(DISTINCT pv_users.userid)
   FROM pv_users
   GROUP BY pv_users.gender;

11.ALTER TABLE page_view ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val');
12.DROP TABLE page_view;
13.    Kinit is the command to create a kerbros security enable


MAP Reduce Example :


In this example we will load the file to profile_log table and then do some analysis on the log table. The output of the analysis is stored in the profile_analysis table :

t.txt:
rama    y       1       1
satys   n       1       1
guru    y       3       3
shayam  n       2       4
guru    y       7       2
rama    y       7       3


hive>  CREATE TABLE profile_log (username STRING, profileflag STRING, profile_num BIGINT, page_id BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

OK
Time taken: 4.402 seconds
 hive> CREATE TABLE profile_log (username STRING, profileflag STRING, profile_num BIGINT, page_id BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
OK
Time taken: 4.402 seconds
hive> LOAD DATA LOCAL INPATH '/home/hive/build/dist/bin/t.txt' OVERWRITE INTO TABLE profile_log;
Copying data from file:/home/hive/build/dist/bin/t.txt
Copying file: file:/home/hive/build/dist/bin/t.txt
Loading data to table default.profile_log
Moved to trash: hdfs://localhost/user/hive/warehouse/profile_log
OK
Time taken: 0.549 seconds
hive> select * from profile_log;                                                                                           
rama    y       1       1
satys   n       1       1
guru    y       3       3
shayam  n       2       4
guru    y       7       2
rama    y       7       3
Time taken: 0.296 seconds
hive>
CREATE TABLE profile_analysis (username STRING, profileflag STRING, num_times BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

OK
Time taken: 4.402 seconds

hive> INSERT OVERWRITE TABLE profile_analysis SELECT profile_log.username, count(profile_log.profileflag), sum(DISTINCT profile_log.profile_num) FROM profile_log GROUP BY profile_log.username;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201208151810_0003, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201208151810_0003
Kill Command = /usr/local/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:50300 -kill job_201208151810_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2012-08-15 19:27:41,280 Stage-1 map = 0%,  reduce = 0%
2012-08-15 19:27:49,618 Stage-1 map = 100%,  reduce = 0%
2012-08-15 19:28:03,302 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201208151810_0003
Loading data to table default.profile_analysis
Moved to trash: hdfs://localhost/user/hive/warehouse/profile_analysis
Table default.profile_analysis stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 40, raw_data_size: 0]
4 Rows loaded to profile_analysis
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   HDFS Read: 277 HDFS Write: 40 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 31.5 seconds
hive> select * from profile_analysis;                                                                                        OK
guru    2       10
rama    2       8
satys   1       1
shayam  1       2
Time taken: 0.251 seconds

In the Hadoop page the run details can be seen as :








JOB Analysis :



User : hadoop-user
JobName : INSERT OVERWRITE TABL...table.redirect_title(Stage-1)
JobConf : hdfs://localhost/user/hadoop-user/.staging/job_201208151810_0001/job.xml
Submitted At : 15/08 18:30:26
Launched At : 15/08 18:30:26 (0sec)
Finished At : 15/08 18:31:09 (42sec)
Status : SUCCESS

Time taken by best performing Map task task_201208151810_0001_m_000000 : 19sec
Average time taken by Map tasks: 19sec
Worse performing map tasks
Task Id
Time taken
19sec
The last Map task task_201208151810_0001_m_000000 finished at (relative to the Job launch time): 15/08 18:30:54 (27sec)

Time taken by best performing shufflejobId task_201208151810_0001_r_000000 : 9sec
Average time taken by Shuffle: 9sec
Worse performing Shuffle(s)
Task Id
Time taken
9sec
The last Shuffle task_201208151810_0001_r_000000 finished at (relative to the Job launch time): 15/08 18:31:04 (37sec)



Example of Hive commands :

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                    page_url STRING, referrer_url STRING,
                    ip STRING COMMENT 'IP Address of the User')
    COMMENT 'This is the page view table'
    PARTITIONED BY(dt STRING, country STRING)
    ROW FORMAT DELIMITED
            FIELDS TERMINATED BY ','
    STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/home/hive/build/dist/bin/p1.txt' OVERWRITE INTO TABLE page_view PARTITION(dt='2008-06-08', country='US');
LOAD DATA LOCAL INPATH '/home/hive/build/dist/bin/p2.txt' OVERWRITE INTO TABLE page_view PARTITION(dt='2008-06-09', country='IN');

select * from page_view where country='IN';
select * from page_view where dt='2008-06-08';
select * from page_view where dt='2008-06-08' and country='IN';



3 comments:

  1. Really Good blog post.provided a helpful information.I hope that you will post more updates like this
    Big data hadoop online training Bangalore

    ReplyDelete
  2. Great Article
    Cloud Computing Projects




    JavaScript Training in Chennai


    JavaScript Training in Chennai


    big data projects for students

    The Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training

    ReplyDelete

 Few of the areas I have worked in past are  :                               Big Data                                                       ...