Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

Hive Queries run slowly



I've a table containing about 4 millions rows and all queries like:

select count(*) from table1 or select col1 from table1 where col2=xxx

runs always in more than 15 seconds.

The table has been stored as ORC with index enabled.

My Env. is Cloudera 5.8.2 and the engine execution is mapreduce. For some reason, TEZ engine does not work.


Any idea how making queries running faster ?







There are so many methods to improve performance. 


In your statement, you have mentioned index enabled for ORC (hope you are referring to Row group/bloom filter, etc)


1. In addition to that, you can also create index on particular columns (On the col1, col2 that you have mentioned in your example)

2. Also You can change the property as needed.  Note: I would recommand to set the below parameters temporarily in hive/beeline CLI before change permenantly in hive-site.xml/Cloudera Manager configuration



set # Should be 80% of (
set mapreduce.reduce.memory.mb=5120;
set ; # Should be 80% of (mapreduce.reduce.memory.mb)





Hi Kumar,


I did set the parameters in the Hive CLI and it broke my query counting number of rows:


Error Code: 2, SQL state: TStatus(statusCode:ERROR_STATUS, infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 2 from,,,,,,, sun.reflect.GeneratedMethodAccessor35:invoke::-1,,,, org.apache.hive.service.cli.session.HiveSessionProxy:access$, org.apache.hive.service.cli.session.HiveSessionProxy$,,,,, com.sun.proxy.$Proxy25:executeStatement::-1,,, org.apache.hive.service.cli.thrift.TCLIService$Processor$, org.apache.hive.service.cli.thrift.TCLIService$Processor$,,,, org.apache.thrift.server.TThreadPoolServer$,, java.util.concurrent.ThreadPoolExecutor$,], sqlState:08S01, errorCode:2, errorMessage:Error while processing statement: FAILED: Execution Error, return code 2 from,


I'd like also to mention that my Env. is a 3 nodes cluster.







Follow the below points one by one


1. As I mentioned already, if you change the parameter temporarily via Hive CLI/beeline, just exit from hive and login back, so it will set back to your original status now. Run the query again, confirm the issue that you are getting due to parameter change


2. As I mentioned already, You can change the property "as needed", meaning... I don't know your memory capacity, In my example, i've given 5120 mb (or) 5 GB... but you have to alter the numbers based on your memory capacity.

Check your memory capacity at CM -> Hosts (menu) -> Get memory capacity for each node


2.1. To make it more easier, get the current memory allocation for Map & Reduce by : Go to CM -> Yarn -> Configuration -> search for "memory.mb"  


Then increase little bit based on your memory capacity 


3. Also the log you are getting is not an actual log... Get it from below steps
Cloudera Manager -> Yarn -> Web UI (Menu)-> ResourceManager Web UI -> (It will open 8088 window) -> Click on Failed link (left) -> Click on Application/History link -> Get Diagnostics informations & Log


If you still need assitance, Hide only confidential information and share the complete log and Diagnostics informations





Few Suggestions that could help you out. 


How many reducers per btypes have you configured ? Check how many reducers are being invoked ? 

If you are doing a Join make sure that the large table is stated in last of the query , so it can be streamable. 

Would consider enabling the parallel execution mode in Hive.

Enable if you can the Local Mode 

Have you enabled Jvm reuse ? 


I've made few changes and it did not really improve perf:


set mapreduce.reduce.memory.mb=4096;

select COUNT(*) from tabl1;


Query ID = user1_20170116100000_878d81d3-6503-48c2-868f-1a9dcc949b63
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1484355422771_0045, Tracking URL = xxxxxxxxxxxx
Kill Command = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hadoop/bin/hadoop job -kill job_1484355422771_0045
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-01-16 10:01:02,660 Stage-1 map = 0%, reduce = 0%
2017-01-16 10:01:07,780 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.25 sec
2017-01-16 10:01:12,900 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.76 sec
MapReduce Total cumulative CPU time: 2 seconds 760 msec
Ended Job = job_1484355422771_0045
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.76 sec HDFS Read: 89678 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 760 msec
Time taken: 16.164 seconds, Fetched: 1 row(s)




Complex query can be tuned but applying count(*) query on hive table with 4 million records returning result in 15 seconds is not an issue from Hive point of view


Still if you need quick result, you have to login to impala-shell instead of Hive and run your query. But pls be aware that impala will use more memory


Even a simple query has same velocity as a SELECT COUNT...

select * from table1 d where d.col1=375;

Both query examples read through all of the data. The expectation is that they take how ever long it takes to read all of it across X mappers. You need to get to the MR job counters to see if there is a bottleneck somewhere. On a 3 node, it is probably peaking out on what it can do.

With that said, having an index should have sped up the query with the restrictive where clause. Was the index over the column in the where clause? How did you set it up and have you check its existence with show index on table1;?


I've improved a bit the query executiont time by setting and running :

SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.cbo.enable=true;
SET hive.stats.autogather=true;

CREATE INDEX table1_idx_1 ON TABLE table1 (col2) AS BITMAP;

SELECT col1 FROM table1 WHERE col2=123456;

Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1484355422771_0225, Tracking URL = xxxxxxxxxxx
Kill Command = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hadoop/bin/hadoop job -kill job_1484355422771_0225
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-01-17 09:48:31,281 Stage-1 map = 0%, reduce = 0%
2017-01-17 09:48:36,413 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.75 sec
MapReduce Total cumulative CPU time: 1 seconds 750 msec
Ended Job = job_1484355422771_0225
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.75 sec HDFS Read: 4925140 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 750 msec
Time taken: 9.996 seconds, Fetched: 1 row(s)

It performs in less than 10seconds however not sure now why it takes so much time to start the Query Job.