Created on 01-14-2017 05:55 PM - edited 09-16-2022 03:54 AM
Hello,
I've a table containing about 4 millions rows and all queries like:
select count(*) from table1 or select col1 from table1 where col2=xxx
runs always in more than 15 seconds.
The table has been stored as ORC with index enabled.
My Env. is Cloudera 5.8.2 and the engine execution is mapreduce. For some reason, TEZ engine does not work.
Any idea how making queries running faster ?
Thanks.
Created 01-14-2017 08:12 PM
There are so many methods to improve performance.
In your statement, you have mentioned index enabled for ORC (hope you are referring to Row group/bloom filter, etc)
1. In addition to that, you can also create index on particular columns (On the col1, col2 that you have mentioned in your example)
2. Also You can change the property as needed. Note: I would recommand to set the below parameters temporarily in hive/beeline CLI before change permenantly in hive-site.xml/Cloudera Manager configuration
set mapreduce.map.memory.mb=5120; set mapreduce.map.java.opts=-Xmx4g # Should be 80% of (mapreduce.map.memory.mb) set mapreduce.reduce.memory.mb=5120; set mapreduce.reduce.java.opts==-Xmx4g ; # Should be 80% of (mapreduce.reduce.memory.mb)
Thanks
Kumar
Created 01-14-2017 08:51 PM
Hi Kumar,
I did set the parameters in the Hive CLI and it broke my query counting number of rows:
Error Code: 2, SQL state: TStatus(statusCode:ERROR_STATUS, infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask:28:27, org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:374, org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:180, org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:208, org.apache.hive.service.cli.operation.Operation:run:Operation.java:316, org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:424, org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:395, sun.reflect.GeneratedMethodAccessor35:invoke::-1, sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43, java.lang.reflect.Method:invoke:Method.java:606, org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78, org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36, org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63, java.security.AccessController:doPrivileged:AccessController.java:-2, javax.security.auth.Subject:doAs:Subject.java:415, org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1693, org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59, com.sun.proxy.$Proxy25:executeStatement::-1, org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:245, org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:503, org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313, org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298, org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39, org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39, org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56, org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286, java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1145, java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:615, java.lang.Thread:run:Thread.java:745], sqlState:08S01, errorCode:2, errorMessage:Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask),
I'd like also to mention that my Env. is a 3 nodes cluster.
Thanks
Created 01-14-2017 09:25 PM
Follow the below points one by one
1. As I mentioned already, if you change the parameter temporarily via Hive CLI/beeline, just exit from hive and login back, so it will set back to your original status now. Run the query again, confirm the issue that you are getting due to parameter change
2. As I mentioned already, You can change the property "as needed", meaning... I don't know your memory capacity, In my example, i've given 5120 mb (or) 5 GB... but you have to alter the numbers based on your memory capacity.
Check your memory capacity at CM -> Hosts (menu) -> Get memory capacity for each node
2.1. To make it more easier, get the current memory allocation for Map & Reduce by : Go to CM -> Yarn -> Configuration -> search for "memory.mb"
Then increase little bit based on your memory capacity
3. Also the log you are getting is not an actual log... Get it from below steps
Cloudera Manager -> Yarn -> Web UI (Menu)-> ResourceManager Web UI -> (It will open 8088 window) -> Click on Failed link (left) -> Click on Application/History link -> Get Diagnostics informations & Log
If you still need assitance, Hide only confidential information and share the complete log and Diagnostics informations
Thanks
Kumar
Created 01-15-2017 07:32 PM
Few Suggestions that could help you out.
How many reducers per btypes have you configured ? Check how many reducers are being invoked ?
If you are doing a Join make sure that the large table is stated in last of the query , so it can be streamable.
Would consider enabling the parallel execution mode in Hive.
Enable if you can the Local Mode
Have you enabled Jvm reuse ?
Created 01-16-2017 07:04 AM
I've made few changes and it did not really improve perf:
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;
set mapreduce.map.java.opts=-Xmx4g;
set mapreduce.reduce.java.opts=-Xmx4g;
select COUNT(*) from tabl1;
Query ID = user1_20170116100000_878d81d3-6503-48c2-868f-1a9dcc949b63
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1484355422771_0045, Tracking URL = xxxxxxxxxxxx
Kill Command = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hadoop/bin/hadoop job -kill job_1484355422771_0045
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-01-16 10:01:02,660 Stage-1 map = 0%, reduce = 0%
2017-01-16 10:01:07,780 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.25 sec
2017-01-16 10:01:12,900 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.76 sec
MapReduce Total cumulative CPU time: 2 seconds 760 msec
Ended Job = job_1484355422771_0045
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.76 sec HDFS Read: 89678 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 760 msec
OK
4019581
Time taken: 16.164 seconds, Fetched: 1 row(s)
Created 01-16-2017 07:58 AM
Complex query can be tuned but applying count(*) query on hive table with 4 million records returning result in 15 seconds is not an issue from Hive point of view
Still if you need quick result, you have to login to impala-shell instead of Hive and run your query. But pls be aware that impala will use more memory
Created 01-16-2017 08:08 AM
Even a simple query has same velocity as a SELECT COUNT...
select * from table1 d where d.col1=375;
Created 01-16-2017 08:03 PM
Created 01-17-2017 06:52 AM
I've improved a bit the query executiont time by setting and running :
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
SET hive.cbo.enable=true;
SET hive.stats.autogather=true;
CREATE INDEX table1_idx_1 ON TABLE table1 (col2) AS BITMAP;
ANALYZE TABLE table1 COMPUTE STATISTICS for COLUMNS col2;
SELECT col1 FROM table1 WHERE col2=123456;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1484355422771_0225, Tracking URL = xxxxxxxxxxx
Kill Command = /opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hadoop/bin/hadoop job -kill job_1484355422771_0225
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-01-17 09:48:31,281 Stage-1 map = 0%, reduce = 0%
2017-01-17 09:48:36,413 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.75 sec
MapReduce Total cumulative CPU time: 1 seconds 750 msec
Ended Job = job_1484355422771_0225
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.75 sec HDFS Read: 4925140 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 750 msec
OK
-3769610
Time taken: 9.996 seconds, Fetched: 1 row(s)
It performs in less than 10seconds however not sure now why it takes so much time to start the Query Job.