Can please let me know why my Hive query is taking such a long time?
The query:
SELECT DISTINCT res_falg as n FROM my_table
took ~75 mins to complete
The query:
SELECT COUNT(*) FROM my_table WHERE res_flag = 1;
took 73 minutes to complete
The table is stored on HDFS (replicated on 4 nodes) as a CSV and is only 100MB in size. It has 6 columns, the types are VARCHAR or TINYINT. The column I'm querying has NAs. It is an external table.
My hive query is running as a Tez job on YARN using 26 cores and ~120 GB of memory. I am not using LLAP.
Any idea what's going on? I'm on HDP 3.0.
EDIT:
I imported the csv into hdfs using the following command:
hdfs dfs -Ddfs.replication=4 -put '/mounted/path/to/file.csv' /dir/file.csv
I used these commands to create the table in Hive:
CREATE EXTERNAL TABLE my_table (
svcpt_id VARCHAR(50),
start_date VARCHAR(8),
end_date VARCHAR(8),
prem_id VARCHAR(20),
res_flag TINYINT,
num_prem_id TINYINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile
LOCATION 'hdfs://ncienspk01/APS';
<br>LOAD DATA INPATH '/dir/file.csv' into table my_table;