Support Questions

Find answers, ask questions, and share your expertise

Hive Queries run slowly

avatar

Hello,

I've a table containing about 4 millions rows and all queries like:

select count(*) from table1 or select col1 from table1 where col2=xxx

runs always in more than 15 seconds.

The table has been stored as ORC with index enabled.

My Env. is Cloudera 5.8.2 and the engine execution is mapreduce. For some reason, TEZ engine does not work.

 

Any idea how making queries running faster ?

 

Thanks.

12 REPLIES 12

avatar
Champion

@MasterOfPuppets

 

There are so many methods to improve performance. 

 

In your statement, you have mentioned index enabled for ORC (hope you are referring to Row group/bloom filter, etc)

 

1. In addition to that, you can also create index on particular columns (On the col1, col2 that you have mentioned in your example)

2. Also You can change the property as needed.  Note: I would recommand to set the below parameters temporarily in hive/beeline CLI before change permenantly in hive-site.xml/Cloudera Manager configuration

 

 

set mapreduce.map.memory.mb=5120;
set mapreduce.map.java.opts=-Xmx4g # Should be 80% of (mapreduce.map.memory.mb)
set mapreduce.reduce.memory.mb=5120;
set mapreduce.reduce.java.opts==-Xmx4g ; # Should be 80% of (mapreduce.reduce.memory.mb)

 

Thanks

Kumar

avatar

Hi Kumar,

 

I did set the parameters in the Hive CLI and it broke my query counting number of rows:

 

Error Code: 2, SQL state: TStatus(statusCode:ERROR_STATUS, infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask:28:27, org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:374, org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:180, org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:208, org.apache.hive.service.cli.operation.Operation:run:Operation.java:316, org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:424, org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:395, sun.reflect.GeneratedMethodAccessor35:invoke::-1, sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43, java.lang.reflect.Method:invoke:Method.java:606, org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78, org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36, org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63, java.security.AccessController:doPrivileged:AccessController.java:-2, javax.security.auth.Subject:doAs:Subject.java:415, org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1693, org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59, com.sun.proxy.$Proxy25:executeStatement::-1, org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:245, org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:503, org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313, org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298, org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39, org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39, org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56, org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286, java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1145, java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:615, java.lang.Thread:run:Thread.java:745], sqlState:08S01, errorCode:2, errorMessage:Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask),

 

I'd like also to mention that my Env. is a 3 nodes cluster.

 

Thanks

 

avatar
Champion

@MasterOfPuppets

 

Follow the below points one by one

 

1. As I mentioned already, if you change the parameter temporarily via Hive CLI/beeline, just exit from hive and login back, so it will set back to your original status now. Run the query again, confirm the issue that you are getting due to parameter change

 

2. As I mentioned already, You can change the property "as needed", meaning... I don't know your memory capacity, In my example, i've given 5120 mb (or) 5 GB... but you have to alter the numbers based on your memory capacity.

Check your memory capacity at CM -> Hosts (menu) -> Get memory capacity for each node

 

2.1. To make it more easier, get the current memory allocation for Map & Reduce by : Go to CM -> Yarn -> Configuration -> search for "memory.mb"  

 

Then increase little bit based on your memory capacity 

 

3. Also the log you are getting is not an actual log... Get it from below steps
Cloudera Manager -> Yarn -> Web UI (Menu)-> ResourceManager Web UI -> (It will open 8088 window) -> Click on Failed link (left) -> Click on Application/History link -> Get Diagnostics informations & Log

 

If you still need assitance, Hide only confidential information and share the complete log and Diagnostics informations

 

Thanks

Kumar

avatar
Champion

Few Suggestions that could help you out. 

 

How many reducers per btypes have you configured ? Check how many reducers are being invoked ? 

If you are doing a Join make sure that the large table is stated in last of the query , so it can be streamable. 

Would consider enabling the parallel execution mode in Hive.

Enable if you can the Local Mode 

Have you enabled Jvm reuse ? 

avatar

I've made few changes and it did not really improve perf:

 

set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;
set mapreduce.map.java.opts=-Xmx4g;
set mapreduce.reduce.java.opts=-Xmx4g;

select COUNT(*) from tabl1;

 


Query ID = user1_20170116100000_878d81d3-6503-48c2-868f-1a9dcc949b63
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1484355422771_0045, Tracking URL = xxxxxxxxxxxx
Kill Command = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hadoop/bin/hadoop job -kill job_1484355422771_0045
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-01-16 10:01:02,660 Stage-1 map = 0%, reduce = 0%
2017-01-16 10:01:07,780 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.25 sec
2017-01-16 10:01:12,900 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.76 sec
MapReduce Total cumulative CPU time: 2 seconds 760 msec
Ended Job = job_1484355422771_0045
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.76 sec HDFS Read: 89678 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 760 msec
OK
4019581
Time taken: 16.164 seconds, Fetched: 1 row(s)

avatar
Champion

@MasterOfPuppets

 

Complex query can be tuned but applying count(*) query on hive table with 4 million records returning result in 15 seconds is not an issue from Hive point of view

 

Still if you need quick result, you have to login to impala-shell instead of Hive and run your query. But pls be aware that impala will use more memory

avatar

Even a simple query has same velocity as a SELECT COUNT...

select * from table1 d where d.col1=375;

avatar
Champion
Both query examples read through all of the data. The expectation is that they take how ever long it takes to read all of it across X mappers. You need to get to the MR job counters to see if there is a bottleneck somewhere. On a 3 node, it is probably peaking out on what it can do.

With that said, having an index should have sped up the query with the restrictive where clause. Was the index over the column in the where clause? How did you set it up and have you check its existence with show index on table1;?

avatar

I've improved a bit the query executiont time by setting and running :

SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.cbo.enable=true;
SET hive.stats.autogather=true;


CREATE INDEX table1_idx_1 ON TABLE table1 (col2) AS BITMAP;
ANALYZE TABLE table1 COMPUTE STATISTICS for COLUMNS col2;

SELECT col1 FROM table1 WHERE col2=123456;

Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1484355422771_0225, Tracking URL = xxxxxxxxxxx
Kill Command = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hadoop/bin/hadoop job -kill job_1484355422771_0225
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-01-17 09:48:31,281 Stage-1 map = 0%, reduce = 0%
2017-01-17 09:48:36,413 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.75 sec
MapReduce Total cumulative CPU time: 1 seconds 750 msec
Ended Job = job_1484355422771_0225
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.75 sec HDFS Read: 4925140 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 750 msec
OK
-3769610
Time taken: 9.996 seconds, Fetched: 1 row(s)


It performs in less than 10seconds however not sure now why it takes so much time to start the Query Job.