Big Data Analytics: Hive

What is Hive

Hive is a data warehousing infrastructure based on the Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs).

1.       To start Hive server from the build directory run :
2.       cd /home/hive/build/dist/bin
            ./hive --service hiveserver
3.       To start command line interface
             ./hive
4.       Command line interface can be used to execute sql statements(hive sqls)
5.       Derby database is used to store the metadata. metastore_db is the directory which contains
          all metadata of the database.
6.       Hive Commands :
           a.       Show databases;
           b.      Show tables;
7.       Example of create table :
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,                 page_url STRING, referrer_url STRING,     ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE LOCATION '/home/hdfs/page_view';

8.       Load data example :
LOAD DATA LOCAL INPATH '/home/hive/build/dist/bin/t.txt' OVERWRITE INTO TABLE page_view;
9. Select * from page_view;
10.Aggregation Example :
   INSERT OVERWRITE TABLE pv_gender_agg
   SELECT pv_users.gender, count(DISTINCT pv_users.userid),count(*), sum(DISTINCT pv_users.userid)
   FROM pv_users
   GROUP BY pv_users.gender;

11.ALTER TABLE page_view ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val');
12.DROP TABLE page_view;
13.   Kinit is the command to create a kerbros security enable

MAP Reduce Example :

In this example we will load the file to profile_log table and then do some analysis on the log table. The output of the analysis is stored in the profile_analysis table :

t.txt:

rama y 1 1

satys n 1 1

guru y 3 3

shayam n 2 4

guru y 7 2

rama y 7 3

hive> CREATE TABLE profile_log (username STRING, profileflag STRING, profile_num BIGINT, page_id BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

Time taken: 4.402 seconds

hive> CREATE TABLE profile_log (username STRING, profileflag STRING, profile_num BIGINT, page_id BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

Time taken: 4.402 seconds

hive> LOAD DATA LOCAL INPATH '/home/hive/build/dist/bin/t.txt' OVERWRITE INTO TABLE profile_log;

Copying data from file:/home/hive/build/dist/bin/t.txt

Copying file: file:/home/hive/build/dist/bin/t.txt

Loading data to table default.profile_log

Moved to trash: hdfs://localhost/user/hive/warehouse/profile_log

Time taken: 0.549 seconds

hive> select * from profile_log;

rama y 1 1

satys n 1 1

guru y 3 3

shayam n 2 4

guru y 7 2

rama y 7 3

Time taken: 0.296 seconds

hive>

CREATE TABLE profile_analysis (username STRING, profileflag STRING, num_times BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

Time taken: 4.402 seconds

hive> INSERT OVERWRITE TABLE profile_analysis SELECT profile_log.username, count(profile_log.profileflag), sum(DISTINCT profile_log.profile_num) FROM profile_log GROUP BY profile_log.username;

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapred.reduce.tasks=<number>

Starting Job = job_201208151810_0003, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201208151810_0003

Kill Command = /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=localhost:50300 -kill job_201208151810_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2012-08-15 19:27:41,280 Stage-1 map = 0%, reduce = 0%

2012-08-15 19:27:49,618 Stage-1 map = 100%, reduce = 0%

2012-08-15 19:28:03,302 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201208151810_0003

Loading data to table default.profile_analysis

Moved to trash: hdfs://localhost/user/hive/warehouse/profile_analysis

Table default.profile_analysis stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 40, raw_data_size: 0]

4 Rows loaded to profile_analysis

MapReduce Jobs Launched:

Job 0: Map: 1 Reduce: 1 HDFS Read: 277 HDFS Write: 40 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

Time taken: 31.5 seconds

hive> select * from profile_analysis; OK

guru 2 10

rama 2 8

satys 1 1

shayam 1 2

Time taken: 0.251 seconds

In the Hadoop page the run details can be seen as :

JOB Analysis :

Hadoop Job job_201208151810_0001

User : hadoop-user
JobName : INSERT OVERWRITE TABL...table.redirect_title(Stage-1)
JobConf : hdfs://localhost/user/hadoop-user/.staging/job_201208151810_0001/job.xml
Submitted At : 15/08 18:30:26
Launched At : 15/08 18:30:26 (0sec)
Finished At : 15/08 18:31:09 (42sec)
Status : SUCCESS

Time taken by best performing Map task task_201208151810_0001_m_000000 : 19sec

Average time taken by Map tasks: 19sec

Worse performing map tasks

Task Id	Time taken
task_201208151810_0001_m_000000	19sec

The last Map task task_201208151810_0001_m_000000 finished at (relative to the Job launch time): 15/08 18:30:54 (27sec)

Time taken by best performing shufflejobId task_201208151810_0001_r_000000 : 9sec

Average time taken by Shuffle: 9sec

Worse performing Shuffle(s)

Task Id	Time taken
task_201208151810_0001_r_000000	9sec

The last Shuffle task_201208151810_0001_r_000000 finished at (relative to the Job launch time): 15/08 18:31:04 (37sec)

Example of Hive commands :

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                    page_url STRING, referrer_url STRING,
                    ip STRING COMMENT 'IP Address of the User')
    COMMENT 'This is the page view table'
    PARTITIONED BY(dt STRING, country STRING)
    ROW FORMAT DELIMITED
            FIELDS TERMINATED BY ','
    STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/home/hive/build/dist/bin/p1.txt' OVERWRITE INTO TABLE page_view PARTITION(dt='2008-06-08', country='US');
LOAD DATA LOCAL INPATH '/home/hive/build/dist/bin/p2.txt' OVERWRITE INTO TABLE page_view PARTITION(dt='2008-06-09', country='IN');

select * from page_view where country='IN';
select * from page_view where dt='2008-06-08';
select * from page_view where dt='2008-06-08' and country='IN';

3 comments:

Tejuteju1 June 2018 at 00:32
Really Good blog post.provided a helpful information.I hope that you will post more updates like this
Big data hadoop online training Bangalore
Technogeekscs24 October 2018 at 02:04
Excellent job and useful information. Hadoop Big Data Classes in Pune
Jagna Co Kalani5 January 2021 at 02:04
Great Article
Cloud Computing Projects

JavaScript Training in Chennai

JavaScript Training in Chennai

big data projects for students

The Angular Training covers a wide range of topics including Components, Angular Directives, Angular Services, Pipes, security fundamentals, Routing, and Angular programmability. The new Angular TRaining will lay the foundation you need to specialise in Single Page Application developer. Angular Training

Pages

Hive

3 comments: