These questions are some of their kind which you will need in your daily programing as reference.
1) How do you access underlying Unix box from Hive shell ?
Ans: use ! , Example : hive>!pwd will give you details of present working directory.
2) What is the command that has no effect in hive ?
Ans: Truncate table , when used with External table. It will not remove records so effect.
3) What is the Format used for Binary Input data ( images) ?
Ans: use : SequenceFileInputFormat .
4) How do you run Hive Query from command line ?
Ans: hive -e 'select a.Ename from Emp a'
5) How to set hive configuration variables ?
Ans: hive -e 'select a.Ename from Emp a' -hiveconf hive.exec.scratchdir=/home/my/hive_scratch -hiveconf mapred.reduce.tasks=16
6) How to dump data out from a query into a file using silent mode ?
Ans: hive -S -e 'select a.Ename from Emp a' > a.txt .
7) How to improve performance of ORDER BY as it goes to single reducer ?
Ans: One word answer is CLUSTER BY , here are the details.
Notes:
ORDER BY x: guarantees global ordering, but does this by pushing all data through
just one reducer. This is basically unacceptable for large datasets.
You end up one sorted file as output.
SORT BY x: orders data at each of N reducers, but each reducer can receive
overlapping ranges of data.
You end up with N or more sorted files with overlapping ranges.
DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x,
but doesn't sort the output of each reducer.
You end up with N or unsorted files with non-overlapping ranges.
CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts
by those ranges at the reducers.
This gives you global ordering, and is the same as doing
(DISTRIBUTE BY x and SORT BY x).
You end up with N or more sorted files with non-overlapping ranges.
8) How to improve performance of Hive queries ?
Ans: There are different techniques can be used as required, here are some details.
Notes:
Executions Engine: Improves the performance dramatically
Caution : Would need some additional hardware requirement
Set up: set hive.execution.engine=tez;
Reference: Horton notes
ORC Files: ORC based tables will be faster with inline index and compressions.
Compression ratio is high and quick.
Ex: CREATE TABLE A_ORC ( customerID int, name string, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
Reference : Horton notes
Vectorization: Vectorized query execution improves performance of operations like
scans, aggregations, filters and joins, by performing them in batches of
1024 rows at once instead of single row each time.
Set up: set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Cost based Query Optimization: Query performance increases on computing stats
Set up: set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Usage: analyze table Tweets compute statistics;
analyze table Tweets compute statistics for columns sender, topic;
(in most recent versions)
analyze table tweets compute statistics for columns;
Use Analytical functions : Its always better to use analytical functions rather going for joins.
Ex: Each record represents a click event in "Click" table, and
we would like to find the latest URL for each sessionID
General Query:
SELECT clicks.* FROM clicks inner join (select sessionID,
max(timestamp) as max_ts from clicks group by sessionID) latest
ON clicks.sessionID = latest.sessionID and
clicks.timestamp = latest.max_ts;
Analytical Query:
SELECT * FROM (SELECT *, RANK() over (partition by sessionID,
order by timestamp desc) as rank
FROM clicks) ranked_clicks
WHERE ranked_clicks.rank=1;
9) what is strict mode in hive ?
Ans: Strict mode is related to partitioned tables and
If hive.exec.dynamic.partition.mode is set to strict, then you need to do
at least one static partition. In non-strict mode, all partitions are allowed
to be dynamic.
If your partitioned table is very large, you could block any full table scan queries
by putting Hive into strict mode using the set hive.mapred.mode=strict command.
In this mode, when users submit a query that would result in a full table scan
(i.e. queries without any partitioned columns) an error is issued.
Also ORDER BY should be used with LIMIT option mandatorily
10) How do you create a compressed table in Hive and how to view the content ?
Ans: In a session you can use create table command to crate a table after the following
commands are issued.
hive> set hive.exec.compress.output=true;
hive> set mapreduce.output.fileoutputformat.compress=true;
hive> set mapreduce.output.fileoutputformat.compress.codec=
org.apache.hadoop.io.compress.GzipCodec;
hive> set mapreduce.output.fileoutputformat.compress.type=BLOCK;
hive> set hive.exec.compress.intermediate=true;
Now create a table using as select as
hive> CREATE TABLE zip_emp ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' AS SELECT * FROM emp;
Garbage is shown if you use following:
hive> dfs -cat /user/hive/warehouse/zip_emp/000000_0.gz;
To see the content we use text mode.
hive> dfs -text /user/hive/warehouse/compressed_emp/000000_0.gz;
11) Use distinct in a tricky way ?
Ans: Instead of
SELECT count DISTINCT field FROM tab;
you can use following so that it will use more reducers.
SELECT count(1) FROM ( SELECT DISTINCT field FROM tab) table;
12) Use high cardinal column first ?
Ans: Instead of
SELECT .....GROUP BY eid, country;
you can use following to improve performance
SELECT .....GROUP BY country, eid;
13) How to create User defined function (UDF) in hive ?
Ans: Create a java class extending org.apache.hadoop.hive.ql.exec.UDF;
Create a jar ( or add to your own util jar) and upload that to hive cluster.
Now using hive command
hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
hive> create temporary function to_upper as 'com.org.hive.udf.ToUpper';
hive> select to_upper(name) from emp;
14) How to remove duplicate records in hive table ?
Ans: create temp table as select distinct columns from orignal_tab;
truncate table original_tab;
insert into orignal_tab select * from temp;
15) Compare and contrast Static and Dynamic Partitions in hive ?
Ans:
Static Partition in Hive
Dynamic Partition in Hive
hive.cli.print.current.db=false
hive> set hive.cli.print.current.db=true;
hive (default)> desc formatted movie;
OK
# Detailed Table Information
hive (default)> create database if not exists vgatst;
OK
Time taken: 0.132 seconds
hive (default)> use vgatst;
-------------------------
hive (vgatst)>
> create table if not exists employee(
> eid int,
> ename string,
> age int
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> with SERDEPROPERTIES (
> "separatorChar" = ",",
> "quoteChar" = "\"",
> "escapeChar" = "\\"
> )
> Stored as textfile;
hive (vgatst)> show tables;
OK
employee
Time taken: 0.015 seconds, Fetched: 1 row(s)
hive (vgatst)> LOAD DATA LOCAL INPATH '/home/abc/mukesh/employee.csv'
> INTO TABLE employee;
hive (vgatst)> select * from employee; //Observe mapreduce is not run for simple select
OK
1201 satish 25
1202 krishna 28
1203 amith 39
1204 javed 23
1205 prudvi 23
1594 mukesh 38
1595 mahesh 35
Time taken: 0.156 seconds, Fetched: 7 row(s)
hive> add FILE /home/abc/mukesh/pyhive.py;
Added resources: [/home/abc/mukesh/pyhive.py]
hive> select TRANSFORM(eid,ename,age) USING 'python pyhive.py' AS (ename,val1,eid,val2,age) FROM employee; // 5 columns in return
Total MapReduce CPU Time Spent: 1 seconds 330 msec
OK
SATISH with eid 1201 is of 25
KRISHNA with eid 1202 is of 28
AMITH with eid 1203 is of 39
JAVED with eid 1204 is of 23
PRUDVI with eid 1205 is of 23
MUKESH with eid 1594 is of 38
MAHESH with eid 1595 is of 35
hive> !cat /home/abc/mukesh/pyhive.py ;
import sys
for line in sys.stdin:
line = line.strip()
eid , ename , age = line.split('\t')
ename = ename.upper()
print '\t'.join([ename,'with eid',eid,' is of ',age])
--------------- parquet ------
hive (vgatst)> create table pemp(eid int,ename string,age int) stored as parquet;
------------ Analytical queries -----------
hive> set hive.cli.print.current.db=true;
hive (vgatst)> create table stocks (dateyymd String, Ticker String, Open Double, High Double,
> Low Double, Close Double, Volume_for_the_day int)
> row format delimited fields terminated by ',';
hive (vgatst)> load data local inpath '/home/abc/mukesh/stock.csv' into table stocks;
Here using lag we can display the yesterday’s closing price of the ticker.
Lag is to be used with over function, inside the over function you can use partition or order by classes
hive (vgatst)> select ticker,dateyymd,close,lag(close,1) over(partition by ticker) as yesterday_price
> from stocks where ticker='A'
Now using the lead function, we will find that whether the following day’s closing price is higher or lesser than today’s
Lets confine to stocks that start with A to avoid huge result.
hive (vgatst)> select ticker,dateyymd,close,
> case(lead(close,1) over(partition by ticker)-close)>0
> when true then "higher"
> when false then "lesser"
> end as Changes from stocks where ticker like 'A%';
hive (vgatst)> select ticker,high as high,Close AS Close,
> first_value(high) over(partition by ticker) AS FIRST_VAL ,
> last_value(high) over(partition by ticker) AS LAST_VAL from stocks where ticker='A';
hive (vgatst)>select ticker,count(ticker) over(partition by ticker) as cnt from stocks;
-- Best raised stock for a day --- % growth per day
select ticker,open,close, (close - open), ((close - open )/open)*100 ,
rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
from stocks where dateyymd ='20100721';
-- Top ten performing stocks of the day ---
hive (vgatst)> select ticker,open,close, (close - open), ((close - open )/open)*100 ,
> rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
> from stocks where dateyymd ='20100721' limit 10;
----------------------------------------------- Indexes -----------------------------------------------
CREATE INDEX idx_cmp_stocks_dateyymd
ON TABLE stocks (dateyymd)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;
ALTER INDEX idx_cmp_stocks_dateyymd on stocks REBUILD;
select ticker,open,close, (close - open), ((close - open )/open)*100 ,
rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
from stocks where dateyymd ='20100721' limit 10;
Time taken: 25.472 seconds, Fetched: 10 row(s) -- with index
hive> show formatted index on stocks;
OK
idx_name tab_name col_names idx_tab_name idx_type comment
idx_cmp_stocks_dateyymd stocks dateyymd default__stocks_idx_cmp_stocks_dateyymd__ compact
Time taken: 0.071 seconds, Fetched: 4 row(s)
DROP INDEX IF EXISTS idx_cmp_stocks_dateyymd ON stocks;
Time taken: 22.078 seconds, Fetched: 10 row(s) -- without index So it is bad index
--------- BIT MAP --------------------
CREATE INDEX idx_bmp_stocks_dateyymd
ON TABLE stocks (dateyymd)
AS 'BITMAP'
WITH DEFERRED REBUILD;
ALTER INDEX idx_bmp_stocks_dateyymd on stocks REBUILD;
select ticker,open,close, (close - open), ((close - open )/open)*100 ,
rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
from stocks where dateyymd ='20100721' limit 10;
Time taken: 22.838 seconds, Fetched: 10 row(s) -- with index
DROP INDEX IF EXISTS idx_bmp_stocks_dateyymd ON stocks;
Time taken: 23.857 seconds, Fetched: 10 row(s) -- without index so good index
Reading : List of Functions
1) How do you access underlying Unix box from Hive shell ?
Ans: use ! , Example : hive>!pwd will give you details of present working directory.
2) What is the command that has no effect in hive ?
Ans: Truncate table , when used with External table. It will not remove records so effect.
3) What is the Format used for Binary Input data ( images) ?
Ans: use : SequenceFileInputFormat .
4) How do you run Hive Query from command line ?
Ans: hive -e 'select a.Ename from Emp a'
5) How to set hive configuration variables ?
Ans: hive -e 'select a.Ename from Emp a' -hiveconf hive.exec.scratchdir=/home/my/hive_scratch -hiveconf mapred.reduce.tasks=16
6) How to dump data out from a query into a file using silent mode ?
Ans: hive -S -e 'select a.Ename from Emp a' > a.txt .
7) How to improve performance of ORDER BY as it goes to single reducer ?
Ans: One word answer is CLUSTER BY , here are the details.
Notes:
ORDER BY x: guarantees global ordering, but does this by pushing all data through
just one reducer. This is basically unacceptable for large datasets.
You end up one sorted file as output.
SORT BY x: orders data at each of N reducers, but each reducer can receive
overlapping ranges of data.
You end up with N or more sorted files with overlapping ranges.
DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x,
but doesn't sort the output of each reducer.
You end up with N or unsorted files with non-overlapping ranges.
CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts
by those ranges at the reducers.
This gives you global ordering, and is the same as doing
(DISTRIBUTE BY x and SORT BY x).
You end up with N or more sorted files with non-overlapping ranges.
8) How to improve performance of Hive queries ?
Ans: There are different techniques can be used as required, here are some details.
Notes:
Executions Engine: Improves the performance dramatically
Caution : Would need some additional hardware requirement
Set up: set hive.execution.engine=tez;
Reference: Horton notes
ORC Files: ORC based tables will be faster with inline index and compressions.
Compression ratio is high and quick.
Ex: CREATE TABLE A_ORC ( customerID int, name string, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
Reference : Horton notes
Vectorization: Vectorized query execution improves performance of operations like
scans, aggregations, filters and joins, by performing them in batches of
1024 rows at once instead of single row each time.
Set up: set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Cost based Query Optimization: Query performance increases on computing stats
Set up: set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Usage: analyze table Tweets compute statistics;
analyze table Tweets compute statistics for columns sender, topic;
(in most recent versions)
analyze table tweets compute statistics for columns;
Use Analytical functions : Its always better to use analytical functions rather going for joins.
Ex: Each record represents a click event in "Click" table, and
we would like to find the latest URL for each sessionID
General Query:
SELECT clicks.* FROM clicks inner join (select sessionID,
max(timestamp) as max_ts from clicks group by sessionID) latest
ON clicks.sessionID = latest.sessionID and
clicks.timestamp = latest.max_ts;
Analytical Query:
SELECT * FROM (SELECT *, RANK() over (partition by sessionID,
order by timestamp desc) as rank
FROM clicks) ranked_clicks
WHERE ranked_clicks.rank=1;
9) what is strict mode in hive ?
Ans: Strict mode is related to partitioned tables and
If hive.exec.dynamic.partition.mode is set to strict, then you need to do
at least one static partition. In non-strict mode, all partitions are allowed
to be dynamic.
If your partitioned table is very large, you could block any full table scan queries
by putting Hive into strict mode using the set hive.mapred.mode=strict command.
In this mode, when users submit a query that would result in a full table scan
(i.e. queries without any partitioned columns) an error is issued.
Also ORDER BY should be used with LIMIT option mandatorily
10) How do you create a compressed table in Hive and how to view the content ?
Ans: In a session you can use create table command to crate a table after the following
commands are issued.
hive> set hive.exec.compress.output=true;
hive> set mapreduce.output.fileoutputformat.compress=true;
hive> set mapreduce.output.fileoutputformat.compress.codec=
org.apache.hadoop.io.compress.GzipCodec;
hive> set mapreduce.output.fileoutputformat.compress.type=BLOCK;
hive> set hive.exec.compress.intermediate=true;
Now create a table using as select as
hive> CREATE TABLE zip_emp ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t' AS SELECT * FROM emp;
Garbage is shown if you use following:
hive> dfs -cat /user/hive/warehouse/zip_emp/000000_0.gz;
To see the content we use text mode.
hive> dfs -text /user/hive/warehouse/compressed_emp/000000_0.gz;
11) Use distinct in a tricky way ?
Ans: Instead of
SELECT count DISTINCT field FROM tab;
you can use following so that it will use more reducers.
SELECT count(1) FROM ( SELECT DISTINCT field FROM tab) table;
12) Use high cardinal column first ?
Ans: Instead of
SELECT .....GROUP BY eid, country;
you can use following to improve performance
SELECT .....GROUP BY country, eid;
13) How to create User defined function (UDF) in hive ?
Ans: Create a java class extending org.apache.hadoop.hive.ql.exec.UDF;
Create a jar ( or add to your own util jar) and upload that to hive cluster.
Now using hive command
hive> ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
hive> create temporary function to_upper as 'com.org.hive.udf.ToUpper';
hive> select to_upper(name) from emp;
14) How to remove duplicate records in hive table ?
Ans: create temp table as select distinct columns from orignal_tab;
truncate table original_tab;
insert into orignal_tab select * from temp;
15) Compare and contrast Static and Dynamic Partitions in hive ?
Ans:
Static Partition in Hive
- Insert input data files individually into a partition table is Static Partition
- Usually when loading files (big files) into Hive Tables static partitions are preferred
- Static Partition saves your time in loading data compared to dynamic partition
- You “statically” add a partition in table and move the file into the partition of the table.
- We can alter the partition in static partition
Dynamic Partition in Hive
- Single insert to partition table is known as dynamic partition
- Usually dynamic partition load the data from non partitioned table
- Dynamic Partition takes more time in loading data compared to static partition
- If you want to partition number of column but you don’t know how many columns then also dynamic partition is suitable
hive> show tables;
OK
categories
customers
---------------------------
hive> set hive.cli.print.current.db;OK
categories
customers
---------------------------
hive.cli.print.current.db=false
hive> set hive.cli.print.current.db=true;
hive (default)> desc formatted movie;
OK
# Detailed Table Information
hive (default)> create database if not exists vgatst;
OK
Time taken: 0.132 seconds
hive (default)> use vgatst;
-------------------------
hive (vgatst)>
> create table if not exists employee(
> eid int,
> ename string,
> age int
> )
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> with SERDEPROPERTIES (
> "separatorChar" = ",",
> "quoteChar" = "\"",
> "escapeChar" = "\\"
> )
> Stored as textfile;
hive (vgatst)> show tables;
OK
employee
Time taken: 0.015 seconds, Fetched: 1 row(s)
hive (vgatst)> LOAD DATA LOCAL INPATH '/home/abc/mukesh/employee.csv'
> INTO TABLE employee;
hive (vgatst)> select * from employee; //Observe mapreduce is not run for simple select
OK
1201 satish 25
1202 krishna 28
1203 amith 39
1204 javed 23
1205 prudvi 23
1594 mukesh 38
1595 mahesh 35
Time taken: 0.156 seconds, Fetched: 7 row(s)
hive> add FILE /home/abc/mukesh/pyhive.py;
Added resources: [/home/abc/mukesh/pyhive.py]
hive> select TRANSFORM(eid,ename,age) USING 'python pyhive.py' AS (ename,val1,eid,val2,age) FROM employee; // 5 columns in return
Total MapReduce CPU Time Spent: 1 seconds 330 msec
OK
SATISH with eid 1201 is of 25
KRISHNA with eid 1202 is of 28
AMITH with eid 1203 is of 39
JAVED with eid 1204 is of 23
PRUDVI with eid 1205 is of 23
MUKESH with eid 1594 is of 38
MAHESH with eid 1595 is of 35
hive> !cat /home/abc/mukesh/pyhive.py ;
import sys
for line in sys.stdin:
line = line.strip()
eid , ename , age = line.split('\t')
ename = ename.upper()
print '\t'.join([ename,'with eid',eid,' is of ',age])
--------------- parquet ------
hive (vgatst)> create table pemp(eid int,ename string,age int) stored as parquet;
------------ Analytical queries -----------
hive> set hive.cli.print.current.db=true;
hive (vgatst)> create table stocks (dateyymd String, Ticker String, Open Double, High Double,
> Low Double, Close Double, Volume_for_the_day int)
> row format delimited fields terminated by ',';
hive (vgatst)> load data local inpath '/home/abc/mukesh/stock.csv' into table stocks;
Here using lag we can display the yesterday’s closing price of the ticker.
Lag is to be used with over function, inside the over function you can use partition or order by classes
hive (vgatst)> select ticker,dateyymd,close,lag(close,1) over(partition by ticker) as yesterday_price
> from stocks where ticker='A'
Now using the lead function, we will find that whether the following day’s closing price is higher or lesser than today’s
Lets confine to stocks that start with A to avoid huge result.
hive (vgatst)> select ticker,dateyymd,close,
> case(lead(close,1) over(partition by ticker)-close)>0
> when true then "higher"
> when false then "lesser"
> end as Changes from stocks where ticker like 'A%';
hive (vgatst)> select ticker,high as high,Close AS Close,
> first_value(high) over(partition by ticker) AS FIRST_VAL ,
> last_value(high) over(partition by ticker) AS LAST_VAL from stocks where ticker='A';
hive (vgatst)>select ticker,count(ticker) over(partition by ticker) as cnt from stocks;
-- Best raised stock for a day --- % growth per day
select ticker,open,close, (close - open), ((close - open )/open)*100 ,
rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
from stocks where dateyymd ='20100721';
-- Top ten performing stocks of the day ---
hive (vgatst)> select ticker,open,close, (close - open), ((close - open )/open)*100 ,
> rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
> from stocks where dateyymd ='20100721' limit 10;
----------------------------------------------- Indexes -----------------------------------------------
CREATE INDEX idx_cmp_stocks_dateyymd
ON TABLE stocks (dateyymd)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;
ALTER INDEX idx_cmp_stocks_dateyymd on stocks REBUILD;
select ticker,open,close, (close - open), ((close - open )/open)*100 ,
rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
from stocks where dateyymd ='20100721' limit 10;
Time taken: 25.472 seconds, Fetched: 10 row(s) -- with index
hive> show formatted index on stocks;
OK
idx_name tab_name col_names idx_tab_name idx_type comment
idx_cmp_stocks_dateyymd stocks dateyymd default__stocks_idx_cmp_stocks_dateyymd__ compact
Time taken: 0.071 seconds, Fetched: 4 row(s)
DROP INDEX IF EXISTS idx_cmp_stocks_dateyymd ON stocks;
Time taken: 22.078 seconds, Fetched: 10 row(s) -- without index So it is bad index
--------- BIT MAP --------------------
CREATE INDEX idx_bmp_stocks_dateyymd
ON TABLE stocks (dateyymd)
AS 'BITMAP'
WITH DEFERRED REBUILD;
ALTER INDEX idx_bmp_stocks_dateyymd on stocks REBUILD;
select ticker,open,close, (close - open), ((close - open )/open)*100 ,
rank() over(partition by dateyymd order by ((close - open )/open)*100 desc ) as closing
from stocks where dateyymd ='20100721' limit 10;
Time taken: 22.838 seconds, Fetched: 10 row(s) -- with index
DROP INDEX IF EXISTS idx_bmp_stocks_dateyymd ON stocks;
Time taken: 23.857 seconds, Fetched: 10 row(s) -- without index so good index
Reading : List of Functions
No comments:
Post a Comment