Support Questions

Find answers, ask questions, and share your expertise

Simple hive query on small dataset taking over an hour to compute on small file

avatar
Contributor

Can please let me know why my Hive query is taking such a long time?

The query:

SELECT DISTINCT res_falg as n FROM my_table

took ~75 mins to complete

The query:

SELECT COUNT(*) FROM my_table WHERE res_flag = 1;

took 73 minutes to complete


The table is stored on HDFS (replicated on 4 nodes) as a CSV and is only 100MB in size. It has 6 columns, the types are VARCHAR or TINYINT. The column I'm querying has NAs. It is an external table.

My hive query is running as a Tez job on YARN using 26 cores and ~120 GB of memory. I am not using LLAP.

Any idea what's going on? I'm on HDP 3.0.

EDIT:
I imported the csv into hdfs using the following command:

hdfs dfs -Ddfs.replication=4 -put '/mounted/path/to/file.csv' /dir/file.csv

I used these commands to create the table in Hive:

CREATE EXTERNAL TABLE my_table ( 
svcpt_id VARCHAR(50),
start_date VARCHAR(8),
end_date VARCHAR(8), 
prem_id VARCHAR(20),
res_flag TINYINT,
num_prem_id TINYINT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
stored as textfile
LOCATION 'hdfs://ncienspk01/APS';
<br>LOAD DATA INPATH '/dir/file.csv' into table my_table;
1 REPLY 1

avatar

Try setting the parameters before running the query. Play with

set mapred.min.split.size=100000000; and max.split.size for optimal performance.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set mapred.min.split.size=100000000;