Created 06-23-2021 07:31 AM
Dear All,
Can Map Reduce2 or TEZ can provide output less than 4 second?
Before going to detail explanation let me give the environment version first.
HDFS -3.1.1.3.1
YARN - 3.1.1
MapReduce2 - 3.1.1
Tez - 0.9.1
Hive - 3.1.0
Data is in ORC file format and assume that H/W infrastructure is enough. Can we expect output from any data query less than five second please? Consider that table has been organized as best optimum way.
Thanks in advance for your analysis.
Created 06-23-2021 10:50 PM
Hello @K_K
Hope you are doing great.
MapReduce2 and TEZ can provide an output of lesser than 4 seconds but it is DEPENDS upon so many factors. Namely query complexity, queue sizing, input data, resource availability, and so on.
Created 06-28-2021 12:01 AM
@K_K, has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
Regards,
Vidya Sargur,Created 06-28-2021 10:54 PM
Hello Shifu ,
Thanks for your response.
We tried all the possibilities in Ambari 2.7.4 cluster but it did not give the output below 4 - 5 second from a simple query of managed table ORC file format. It will be great, if you elaborate more please.
Thanks,
Created 06-29-2021 07:32 AM
Hello @K_K
Once you run a query in beeline pick the queryID and trace the queryID in Hiveserver2 logs to figure out how much time it takes in the HTTP handler thread and the background thread to figure out any slowness in this part.
Once the job goes through this it reaches YARN so you need to check the YARN application log of the query about where it is getting slow whether at AM level/container assigning level or task level. In this way, you can see where it is taking time.
If it is a managed table you can run major compaction in the table to compress all the delta files into a single base file, in this way you can eliminate multiple HDFS scanning while running the query.
You can also run explain plan against the query to figure out the flow and how much data it is processing.
You can also run analyze query against the table to collect the column stats and table stats that will increase the query performance.
All the jobs cannot be completed in lesser than 4 seconds.
Reference:
https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ANALYZETABLE%3Ctable1%3ECACHEMETA...
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/performance-tuning/content/hive_query_result_c...
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/using-hiveql/content/hive_hive_3_tables.html
Created 07-06-2021 01:28 AM
Hi @K_K, has the reply helped resolve your issue? If so, can please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future?
Regards,
Vidya Sargur,