Support Questions

daniel_zafar · ‎08-22-2018

Can please let me know why my Hive query is taking such a long time?

The query:

SELECT DISTINCT res_falg as n FROM my_table

took ~75 mins to complete

The query:

SELECT COUNT(*) FROM my_table WHERE res_flag = 1;

took 73 minutes to complete

The table is stored on HDFS (replicated on 4 nodes) as a CSV and is only 100MB in size. It has 6 columns, the types are VARCHAR or TINYINT. The column I'm querying has NAs. It is an external table.

My hive query is running as a Tez job on YARN using 26 cores and ~120 GB of memory. I am not using LLAP.

Any idea what's going on? I'm on HDP 3.0.

EDIT:
I imported the csv into hdfs using the following command:

hdfs dfs -Ddfs.replication=4 -put '/mounted/path/to/file.csv' /dir/file.csv

I used these commands to create the table in Hive:

CREATE EXTERNAL TABLE my_table ( 
svcpt_id VARCHAR(50),
start_date VARCHAR(8),
end_date VARCHAR(8), 
prem_id VARCHAR(20),
res_flag TINYINT,
num_prem_id TINYINT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
stored as textfile
LOCATION 'hdfs://ncienspk01/APS';
<br>LOAD DATA INPATH '/dir/file.csv' into table my_table;

kgautam · ‎09-10-2018

Try setting the parameters before running the query. Play with

set mapred.min.split.size=100000000; and max.split.size for optimal performance.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set mapred.min.split.size=100000000;

Cloudera Community

Support Questions

Simple hive query on small dataset taking over an hour to compute on small file