Support Questions

Find answers, ask questions, and share your expertise

Can Map Reduce2 or TEZ can provide output less than 4 second?

avatar
Explorer

Dear All,

 

Can Map Reduce2 or TEZ can provide output less than 4 second?

Before going to detail explanation let me give the environment version first.

HDFS -3.1.1.3.1

YARN - 3.1.1

MapReduce2 - 3.1.1

Tez - 0.9.1

Hive - 3.1.0

Data is in ORC file format and assume that H/W infrastructure is enough. Can we expect output from any data query less than five second please? Consider that table has been organized as best optimum way.

 

Thanks in advance for your analysis.

5 REPLIES 5

avatar
Expert Contributor

Hello @K_K 

Hope you are doing great.

MapReduce2 and TEZ can provide an output of lesser than 4 seconds but it is DEPENDS upon so many factors. Namely query complexity, queue sizing, input data, resource availability, and so on. 

avatar
Community Manager

@K_K, has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. 



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Explorer

Hello Shifu ,

 

Thanks for your response.

We tried all the possibilities in Ambari 2.7.4 cluster but it did not give the output below 4 - 5 second from a simple query of managed table ORC file format. It will be great, if you elaborate more please.

 

Thanks,

avatar
Expert Contributor

Hello @K_K 

 

Once you run a query in beeline pick the queryID and trace the queryID in Hiveserver2 logs to figure out how much time it takes in the HTTP handler thread and the background thread to figure out any slowness in this part.

Once the job goes through this it reaches YARN so you need to check the YARN application log of the query about where it is getting slow whether at AM level/container assigning level or task level. In this way, you can see where it is taking time.

If it is a managed table you can run major compaction in the table to compress all the delta files into a single base file, in this way you can eliminate multiple HDFS scanning while running the query.

You can also run explain plan against the query to figure out the flow and how much data it is processing.

You can also run analyze query against the table to collect the column stats and table stats that will increase the query performance.

All the jobs cannot be completed in lesser than 4 seconds.

Reference:
https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ANALYZETABLE%3Ctable1%3ECACHEMETA...
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/performance-tuning/content/hive_query_result_c...

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/using-hiveql/content/hive_hive_3_tables.html

 

avatar
Community Manager

Hi @K_K, has the reply helped resolve your issue? If so, can please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future? 



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: