Hadoop Hive Practice Questions

These questions are some of their kind which you will need in your daily programing as reference.

1) How do you access underlying Unix box from Hive shell ?
    Ans: use !   ,   Example : hive>!pwd      will give you details of present working directory. 

2) What is the command that has no effect in hive ?
    Ans: Truncate table , when used with External table. It will not remove records so effect.

3) What is the Format used for Binary Input data ( images) ?
    Ans: use : SequenceFileInputFormat .  

4) How do you run Hive Query from command line ?
    Ans: hive -e 'select a.Ename from Emp a'

5) How to set hive configuration variables ?
    Ans: hive -e 'select a.Ename from Emp a' -hiveconf hive.exec.scratchdir=/home/my/hive_scratch  -hiveconf mapred.reduce.tasks=16

6) How to dump data out from a query into a file using silent mode ?
    Ans: hive -S -e 'select a.Ename from Emp a' > a.txt . 

7) How to improve performance of  ORDER BY as it goes to single reducer ?
    Ans: One word answer is CLUSTER BY , here are the details.
    Notes: 
    ORDER BY x: guarantees global ordering, but does this by pushing all data through 
                       just one reducer. This is basically unacceptable for large datasets. 
                       You end up one sorted file as output.
    SORT BY x: orders data at each of N reducers, but each reducer can receive 

                       overlapping ranges of data.
                       You end up with N or more sorted files with overlapping ranges.
    DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, 

                       but doesn't sort the output of each reducer.
                       You end up with N or unsorted files with non-overlapping ranges.
    CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts 

                        by those ranges at the reducers.
                        This gives you global ordering, and is the same as doing 

                        (DISTRIBUTE BY x and SORT BY x).
                       You end up with N or more sorted files with non-overlapping ranges.

8) How to improve performance of  Hive queries ?
    Ans: There are different techniques can be used as required, here are some details.
    Notes:
    Executions Engine: Improves the performance dramatically
                                Caution : Would need some additional hardware requirement
                                Set up: set hive.execution.engine=tez;
                                Reference: Horton notes
    ORC Files:    ORC based tables will be faster with inline index and compressions.

                       Compression ratio is high and quick.
                        Ex:
CREATE TABLE A_ORC ( customerID int, name string, address string
                              ) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”); 
                            Reference : Horton notes
   Vectorization: Vectorized query execution improves performance of operations like 
                         scans, aggregations, filters and joins, by performing them in batches of
                         1024 rows at once instead of single row each time.
                         Set up:     set hive.vectorized.execution.enabled = true;
                                        set hive.vectorized.execution.reduce.enabled = true;
    Cost based Query Optimization: Query performance increases on computing stats

                         Set up:    set hive.cbo.enable=true; 
                                       set hive.compute.query.using.stats=true;
                                       set hive.stats.fetch.column.stats=true;
                                       set hive.stats.fetch.partition.stats=true;
                         Usage: analyze table Tweets compute statistics;
                                   analyze table Tweets compute statistics for columns sender, topic;
                                   (in most recent versions)
                                   analyze table tweets compute statistics for columns;
     Use Analytical functions : Its always better to use analytical functions rather going for joins.
                          Ex: Each record represents a click event in "Click" table, and 
                                we would like to find the latest URL for each sessionID

                          General Query:
                                SELECT clicks.* FROM clicks inner join  (select sessionID,   
                                       max(timestamp) as max_ts from clicks group by sessionID) latest
                                 ON clicks.sessionID = latest.sessionID and
                                       clicks.timestamp = latest.max_ts; 

                          Analytical Query: 
                                  SELECT * FROM (SELECT *, RANK() over (partition by sessionID,
                                                          order by timestamp desc) as rank
                                                           FROM clicks) ranked_clicks
                                   WHERE ranked_clicks.rank=1;
                       
9) what is strict mode in hive ?
    Ans: Strict mode is related to partitioned tables and
           If hive.exec.dynamic.partition.mode is set to strict, then you need to do 
           at least one static partition. In non-strict mode, all partitions are allowed 
           to be dynamic.
           If your partitioned table is very large, you could block any full table scan queries

           by putting Hive into strict mode using the set hive.mapred.mode=strict command. 
           In this mode, when users submit a query that would result in a full table scan 
           (i.e. queries without any partitioned columns) an error is issued.
            Also ORDER BY should be used with LIMIT option mandatorily


10) How do you create a compressed table in Hive and how to view the content ?
    Ans: In a session you can use create table command to crate a table after the following
            commands are issued.
            hive> set hive.exec.compress.output=true;
            hive> set mapreduce.output.fileoutputformat.compress=true;
            hive> set mapreduce.output.fileoutputformat.compress.codec=                      
                          org.apache.hadoop.io.compress.GzipCodec;
           hive> set mapreduce.output.fileoutputformat.compress.type=BLOCK;
           hive> set hive.exec.compress.intermediate=true;

            Now create a table using as select as
            hive> CREATE TABLE zip_emp ROW FORMAT DELIMITED FIELDS 
                     TERMINATED BY '\t' AS SELECT * FROM emp;

           Garbage is shown if you use following:

           hive> dfs -cat /user/hive/warehouse/zip_emp/000000_0.gz;
           To see the content we use text mode.

           hive> dfs -text /user/hive/warehouse/compressed_emp/000000_0.gz;
         
11) Use distinct in a tricky way  ?
    Ans: Instead of 
                 SELECT count DISTINCT field FROM tab;
           you can use following so that it will use more reducers. 
                 SELECT  count(1) FROM (  SELECT DISTINCT field FROM tab) table; 

12) Use high cardinal column first  ?
    Ans: Instead of 
                 SELECT .....GROUP BY eid, country;
           you can use following to improve performance  
                 SELECT .....GROUP BY  country, eid;

13) How to create User defined function (UDF) in hive  ?
    Ans: Create a java class extending org.apache.hadoop.hive.ql.exec.UDF
            Create a jar ( or add to your own util jar) and upload that to hive cluster.
          Now using hive command
           hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
           hive> create temporary function to_upper as 'com.org.hive.udf.ToUpper';
           hive>  select to_upper(name) from emp;

14) How to remove duplicate records in hive table ?
    Ans: create temp table as select distinct columns from orignal_tab;
           truncate table original_tab;
           insert into orignal_tab select * from temp;

15) Compare and contrast Static and Dynamic Partitions in hive ?

    Ans: 
    Static Partition in Hive

  • Insert input data files individually into a partition table is Static Partition
  • Usually when loading files (big files) into Hive Tables static partitions are preferred
  • Static Partition saves your time in loading data compared to dynamic partition
  • You “statically” add a partition in table and move the file into the partition of the table.
  • We can alter the partition in static partition

    Dynamic Partition in Hive

  • Single insert to partition table is known as dynamic partition
  • Usually dynamic partition load the data from non partitioned table
  • Dynamic Partition takes more time in loading data compared to static partition
  • If you want to partition number of column but you don’t know how many columns then also dynamic partition is suitable
----------------------------   Quick Queries ---------------------
hive> show tables;
OK
categories

customers
---------------------------
hive> set hive.cli.print.current.db;
hive.cli.print.current.db=false
hive> set hive.cli.print.current.db=true;
hive (default)> desc formatted movie;
OK
# Detailed Table Information

hive (default)> create database if not exists vgatst;
OK
Time taken: 0.132 seconds
hive (default)> use vgatst;
-------------------------
hive (vgatst)>
             > create table if not exists employee(
             >  eid int,
             >  ename string,
             >  age int
             >  )
             >  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
             >  with SERDEPROPERTIES (
             >  "separatorChar" = ",",
             >  "quoteChar"     = "\"",
             >  "escapeChar"    = "\\"
             >  )

             >  Stored as textfile;
hive (vgatst)> show tables;
OK
employee
Time taken: 0.015 seconds, Fetched: 1 row(s)
hive (vgatst)> LOAD DATA LOCAL INPATH '/home/abc/mukesh/employee.csv'
             >  INTO TABLE employee;


hive (vgatst)> select * from employee;  //Observe mapreduce is not run for simple select
OK
1201    satish  25
1202    krishna 28
1203    amith   39
1204    javed   23
1205    prudvi  23
1594    mukesh  38
1595    mahesh  35
Time taken: 0.156 seconds, Fetched: 7 row(s)


hive> add FILE /home/abc/mukesh/pyhive.py;
Added resources: [/home/abc/mukesh/pyhive.py] 

hive> select TRANSFORM(eid,ename,age) USING 'python pyhive.py' AS (ename,val1,eid,val2,age) FROM employee; // 5 columns in return

Total MapReduce CPU Time Spent: 1 seconds 330 msec
OK
SATISH  with eid        1201     is of  25
KRISHNA with eid        1202     is of  28
AMITH   with eid        1203     is of  39
JAVED   with eid        1204     is of  23
PRUDVI  with eid        1205     is of  23
MUKESH  with eid        1594     is of  38
MAHESH  with eid        1595     is of  35

hive> !cat /home/abc/mukesh/pyhive.py ;
import sys
for line in sys.stdin:
    line = line.strip()
    eid , ename , age = line.split('\t')
    ename = ename.upper()
    print '\t'.join([ename,'with eid',eid,' is of ',age])

---------------  parquet ------
hive (vgatst)> create table pemp(eid int,ename string,age int) stored as parquet;

------------   Analytical queries -----------

hive> set hive.cli.print.current.db=true;
hive (vgatst)> create table stocks (dateyymd String, Ticker String, Open Double, High Double, 
                    > Low Double, Close Double, Volume_for_the_day int)
                    > row format delimited fields terminated by ',';

hive (vgatst)> load data local inpath '/home/abc/mukesh/stock.csv' into table stocks;

Here using lag we can display the yesterday’s closing price of the ticker.
Lag is to be used with over function, inside the over function you can use partition or order by classes

hive (vgatst)> select ticker,dateyymd,close,lag(close,1) over(partition by ticker) as yesterday_price
                    > from stocks where ticker='A'

Now using the lead function, we will find that whether the following day’s closing price is higher or lesser than today’s
 Lets confine to stocks that start with A to avoid huge result.

hive (vgatst)> select ticker,dateyymd,close,
                    > case(lead(close,1) over(partition by ticker)-close)>0 
  > when true then "higher" 
                 > when false then "lesser" 
                 > end as Changes from stocks where ticker like 'A%';

hive (vgatst)> select ticker,high as high,Close AS Close, 
                    > first_value(high) over(partition by ticker) AS FIRST_VAL ,
                    > last_value(high) over(partition by ticker) AS LAST_VAL from stocks where ticker='A';

hive (vgatst)>select ticker,count(ticker) over(partition by ticker) as cnt from stocks;


-- Best raised stock for a day --- % growth per day
  select ticker,open,close, (close - open), ((close - open )/open)*100 ,
  rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
  from stocks where dateyymd ='20100721';

-- Top ten performing stocks of the day ---

hive (vgatst)> select ticker,open,close, (close - open), ((close - open )/open)*100 ,
                    > rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
                 > from stocks where dateyymd ='20100721' limit 10;
 
-----------------------------------------------   Indexes  -----------------------------------------------

CREATE INDEX idx_cmp_stocks_dateyymd
ON TABLE stocks (dateyymd)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;

ALTER INDEX idx_cmp_stocks_dateyymd on stocks REBUILD;

          select ticker,open,close, (close - open), ((close - open )/open)*100 ,
          rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
       from stocks where dateyymd ='20100721' limit 10;
Time taken: 25.472 seconds, Fetched: 10 row(s)  -- with index

hive> show formatted index on stocks;
OK
idx_name                          tab_name col_names   idx_tab_name                    idx_type            comment
idx_cmp_stocks_dateyymd stocks    dateyymd    default__stocks_idx_cmp_stocks_dateyymd__       compact
Time taken: 0.071 seconds, Fetched: 4 row(s)

DROP INDEX IF EXISTS idx_cmp_stocks_dateyymd ON stocks;

Time taken: 22.078 seconds, Fetched: 10 row(s)   -- without index  So it is bad index 

--------- BIT MAP --------------------
CREATE INDEX idx_bmp_stocks_dateyymd
ON TABLE stocks (dateyymd)
AS 'BITMAP'
WITH DEFERRED REBUILD;

ALTER INDEX idx_bmp_stocks_dateyymd on stocks REBUILD;

       select ticker,open,close, (close - open), ((close - open )/open)*100 ,
       rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
    from stocks where dateyymd ='20100721' limit 10;

Time taken: 22.838 seconds, Fetched: 10 row(s)  -- with index

DROP INDEX IF EXISTS idx_bmp_stocks_dateyymd ON stocks;

Time taken: 23.857 seconds, Fetched: 10 row(s)  -- without index so good index 



Reading : List of Functions 

No comments:

Post a Comment