TechnicalPhilosophy: Hadoop Hive Practice Questions

These questions are some of their kind which you will need in your daily programing as reference.

1) How do you access underlying Unix box from Hive shell ?
    Ans: use !   ,   Example : hive>!pwd      will give you details of present working directory.

2) What is the command that has no effect in hive ?
Ans: Truncate table , when used with External table. It will not remove records so effect.

3) What is the Format used for Binary Input data ( images) ?
Ans: use : SequenceFileInputFormat .

4) How do you run Hive Query from command line ?
    Ans: hive -e 'select a.Ename from Emp a'

5) How to set hive configuration variables ?
Ans: hive -e 'select a.Ename from Emp a' -hiveconf hive.exec.scratchdir=/home/my/hive_scratch -hiveconf mapred.reduce.tasks=16

6) How to dump data out from a query into a file using silent mode ?
Ans: hive -S -e 'select a.Ename from Emp a' > a.txt .

7) How to improve performance of ORDER BY as it goes to single reducer ?
    Ans: One word answer is CLUSTER BY , here are the details.
Notes:
    ORDER BY x: guarantees global ordering, but does this by pushing all data through
                       just one reducer. This is basically unacceptable for large datasets.
                       You end up one sorted file as output.
    SORT BY x: orders data at each of N reducers, but each reducer can receive
   overlapping ranges of data.
                       You end up with N or more sorted files with overlapping ranges.
    DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x,
                       but doesn't sort the output of each reducer.
                       You end up with N or unsorted files with non-overlapping ranges.
    CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts
                        by those ranges at the reducers.
                        This gives you global ordering, and is the same as doing
                        (DISTRIBUTE BY x and SORT BY x).
                       You end up with N or more sorted files with non-overlapping ranges.

8) How to improve performance of Hive queries ?
    Ans: There are different techniques can be used as required, here are some details.
Notes:
    Executions Engine: Improves the performance dramatically
                                Caution : Would need some additional hardware requirement
                                Set up: set hive.execution.engine=tez;
                                Reference: Horton notes
    ORC Files:    ORC based tables will be faster with inline index and compressions.
   Compression ratio is high and quick.
                        Ex: CREATE TABLE A_ORC ( customerID int, name string, address string
                              ) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
Reference : Horton notes
   Vectorization: Vectorized query execution improves performance of operations like
                         scans, aggregations, filters and joins, by performing them in batches of
   1024 rows at once instead of single row each time.
   Set up:     set hive.vectorized.execution.enabled = true;
                                        set hive.vectorized.execution.reduce.enabled = true;
    Cost based Query Optimization: Query performance increases on computing stats
                         Set up:    set hive.cbo.enable=true;
                                       set hive.compute.query.using.stats=true;
                                       set hive.stats.fetch.column.stats=true;
                                       set hive.stats.fetch.partition.stats=true;
                         Usage: analyze table Tweets compute statistics;
                                   analyze table Tweets compute statistics for columns sender, topic;
                                   (in most recent versions)
   analyze table tweets compute statistics for columns;
   Use Analytical functions : Its always better to use analytical functions rather going for joins.
Ex: Each record represents a click event in "Click" table, and
                                we would like to find the latest URL for each sessionID

                          General Query:
SELECT clicks.* FROM clicks inner join (select sessionID,
       max(timestamp) as max_ts from clicks group by sessionID) latest
                                 ON clicks.sessionID = latest.sessionID and
                                       clicks.timestamp = latest.max_ts;

Analytical Query:
                                  SELECT * FROM (SELECT *, RANK() over (partition by sessionID,
                                                          order by timestamp desc) as rank
                                                           FROM clicks) ranked_clicks
                                   WHERE ranked_clicks.rank=1;

9) what is strict mode in hive ?
Ans: Strict mode is related to partitioned tables and
           If hive.exec.dynamic.partition.mode is set to strict, then you need to do
           at least one static partition. In non-strict mode, all partitions are allowed
           to be dynamic.
           If your partitioned table is very large, you could block any full table scan queries
           by putting Hive into strict mode using the set hive.mapred.mode=strict command.
           In this mode, when users submit a query that would result in a full table scan
           (i.e. queries without any partitioned columns) an error is issued.
            Also ORDER BY should be used with LIMIT option mandatorily

10) How do you create a compressed table in Hive and how to view the content ?
Ans: In a session you can use create table command to crate a table after the following
commands are issued.
  hive> set hive.exec.compress.output=true;
hive> set mapreduce.output.fileoutputformat.compress=true;
hive> set mapreduce.output.fileoutputformat.compress.codec=
org.apache.hadoop.io.compress.GzipCodec;
hive> set mapreduce.output.fileoutputformat.compress.type=BLOCK;
hive> set hive.exec.compress.intermediate=true;

Now create a table using as select as
  hive> CREATE TABLE zip_emp ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' AS SELECT * FROM emp;

Garbage is shown if you use following:

hive> dfs -cat /user/hive/warehouse/zip_emp/000000_0.gz;
To see the content we use text mode.

hive> dfs -text /user/hive/warehouse/compressed_emp/000000_0.gz;

11) Use distinct in a tricky way  ?
Ans: Instead of
SELECT count DISTINCT field FROM tab;
you can use following so that it will use more reducers.
SELECT count(1) FROM ( SELECT DISTINCT field FROM tab) table;

12) Use high cardinal column first  ?
Ans: Instead of
SELECT .....GROUP BY eid, country;
you can use following to improve performance
SELECT .....GROUP BY country, eid;

13) How to create User defined function (UDF) in hive  ?
Ans: Create a java class extending org.apache.hadoop.hive.ql.exec.UDF;
Create a jar ( or add to your own util jar) and upload that to hive cluster.
Now using hive command
hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
hive> create temporary function to_upper as 'com.org.hive.udf.ToUpper';
hive> select to_upper(name) from emp;

14) How to remove duplicate records in hive table ?
Ans: create temp table as select distinct columns from orignal_tab;
truncate table original_tab;
insert into orignal_tab select * from temp;

15) Compare and contrast Static and Dynamic Partitions in hive ?

Ans:
Static Partition in Hive

Insert input data files individually into a partition table is Static Partition
Usually when loading files (big files) into Hive Tables static partitions are preferred
Static Partition saves your time in loading data compared to dynamic partition
You “statically” add a partition in table and move the file into the partition of the table.
We can alter the partition in static partition

Dynamic Partition in Hive

Single insert to partition table is known as dynamic partition
Usually dynamic partition load the data from non partitioned table
Dynamic Partition takes more time in loading data compared to static partition
If you want to partition number of column but you don’t know how many columns then also dynamic partition is suitable

---------------------------- Quick Queries ---------------------

hive> show tables;
OK
categories

customers
---------------------------

hive> set hive.cli.print.current.db;
hive.cli.print.current.db=false
hive> set hive.cli.print.current.db=true;
hive (default)> desc formatted movie;
OK
# Detailed Table Information

hive (default)> create database if not exists vgatst;
OK
Time taken: 0.132 seconds
hive (default)> use vgatst;
-------------------------
hive (vgatst)>
> create table if not exists employee(
> eid int,
> ename string,
> age int
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> with SERDEPROPERTIES (
> "separatorChar" = ",",
> "quoteChar" = "\"",
> "escapeChar" = "\\"
> )

> Stored as textfile;
hive (vgatst)> show tables;
OK
employee
Time taken: 0.015 seconds, Fetched: 1 row(s)
hive (vgatst)> LOAD DATA LOCAL INPATH '/home/abc/mukesh/employee.csv'
> INTO TABLE employee;

hive (vgatst)> select * from employee; //Observe mapreduce is not run for simple select
OK
1201 satish 25
1202 krishna 28
1203 amith 39
1204 javed 23
1205 prudvi 23
1594 mukesh 38
1595 mahesh 35
Time taken: 0.156 seconds, Fetched: 7 row(s)

hive> add FILE /home/abc/mukesh/pyhive.py;
Added resources: [/home/abc/mukesh/pyhive.py]

hive> select TRANSFORM(eid,ename,age) USING 'python pyhive.py' AS (ename,val1,eid,val2,age) FROM employee; // 5 columns in return

Total MapReduce CPU Time Spent: 1 seconds 330 msec
OK
SATISH with eid 1201 is of 25
KRISHNA with eid 1202 is of 28
AMITH with eid 1203 is of 39
JAVED with eid 1204 is of 23
PRUDVI with eid 1205 is of 23
MUKESH with eid 1594 is of 38
MAHESH with eid 1595 is of 35

hive> !cat /home/abc/mukesh/pyhive.py ;
import sys
for line in sys.stdin:
line = line.strip()
eid , ename , age = line.split('\t')
ename = ename.upper()
print '\t'.join([ename,'with eid',eid,' is of ',age])

--------------- parquet ------
hive (vgatst)> create table pemp(eid int,ename string,age int) stored as parquet;

------------ Analytical queries -----------

hive> set hive.cli.print.current.db=true;
hive (vgatst)> create table stocks (dateyymd String, Ticker String, Open Double, High Double,
> Low Double, Close Double, Volume_for_the_day int)
> row format delimited fields terminated by ',';

hive (vgatst)> load data local inpath '/home/abc/mukesh/stock.csv' into table stocks;

Here using lag we can display the yesterday’s closing price of the ticker.
Lag is to be used with over function, inside the over function you can use partition or order by classes

hive (vgatst)> select ticker,dateyymd,close,lag(close,1) over(partition by ticker) as yesterday_price
> from stocks where ticker='A'

Now using the lead function, we will find that whether the following day’s closing price is higher or lesser than today’s
Lets confine to stocks that start with A to avoid huge result.

hive (vgatst)> select ticker,dateyymd,close,
> case(lead(close,1) over(partition by ticker)-close)>0
> when true then "higher"
> when false then "lesser"
> end as Changes from stocks where ticker like 'A%';

hive (vgatst)> select ticker,high as high,Close AS Close,
> first_value(high) over(partition by ticker) AS FIRST_VAL ,
> last_value(high) over(partition by ticker) AS LAST_VAL from stocks where ticker='A';

hive (vgatst)>select ticker,count(ticker) over(partition by ticker) as cnt from stocks;

-- Best raised stock for a day --- % growth per day
select ticker,open,close, (close - open), ((close - open )/open)*100 ,
rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
from stocks where dateyymd ='20100721';

-- Top ten performing stocks of the day ---

hive (vgatst)> select ticker,open,close, (close - open), ((close - open )/open)*100 ,
> rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
> from stocks where dateyymd ='20100721' limit 10;

----------------------------------------------- Indexes -----------------------------------------------

CREATE INDEX idx_cmp_stocks_dateyymd
ON TABLE stocks (dateyymd)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;

ALTER INDEX idx_cmp_stocks_dateyymd on stocks REBUILD;

select ticker,open,close, (close - open), ((close - open )/open)*100 ,
rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
from stocks where dateyymd ='20100721' limit 10;
Time taken: 25.472 seconds, Fetched: 10 row(s) -- with index

hive> show formatted index on stocks;
OK
idx_name tab_name col_names idx_tab_name idx_type comment
idx_cmp_stocks_dateyymd stocks dateyymd default__stocks_idx_cmp_stocks_dateyymd__ compact
Time taken: 0.071 seconds, Fetched: 4 row(s)

DROP INDEX IF EXISTS idx_cmp_stocks_dateyymd ON stocks;

Time taken: 22.078 seconds, Fetched: 10 row(s) -- without index So it is bad index

--------- BIT MAP --------------------
CREATE INDEX idx_bmp_stocks_dateyymd
ON TABLE stocks (dateyymd)
AS 'BITMAP'
WITH DEFERRED REBUILD;

ALTER INDEX idx_bmp_stocks_dateyymd on stocks REBUILD;

select ticker,open,close, (close - open), ((close - open )/open)*100 ,
rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
from stocks where dateyymd ='20100721' limit 10;

Time taken: 22.838 seconds, Fetched: 10 row(s) -- with index

DROP INDEX IF EXISTS idx_bmp_stocks_dateyymd ON stocks;

Time taken: 23.857 seconds, Fetched: 10 row(s) -- without index so good index

Reading : List of Functions

TechnicalPhilosophy

Labels

Hadoop Hive Practice Questions

No comments:

Post a Comment