I am trying to get results from Spark to Beeline. For one complex query I am working with, it takes anywahere from 5-15 seconds for the results to get displayed in Beeline.
That is, when I run the query in spark shell as:
val q1 = <query>
it takes 16 seconds to show the query. Whereas the same query in beeline (connected to Spark2) takes 20-24 seconds to display. After looking at the logs from the history server, I can see the following:
1. The code intialization (run at ThreadPoolExecutor.java) take sthe same time for both applications (spark shell and spark from beeline).
2. The time taken for "reduce" stage of the data takes same time for both.
3. Only the time taken for the map/execution stage is different (I can see this is the Summary metrics for the completed tasks). For spark shell, "Max Duration" is much les sthan "max duration" in beeline-spark.
I have a number of questions here which I am not able to find an answer by a google search:
1. Beeline does not display any information about the task stages on the console. It just prints the final result and the time taken for it. I tried changing the "advanced log4j properties for beeline in hive configs and also some properties in spark log4j properties but did not help. How can I see the logging info??
2. What all possible parameters I might have to change to reduce the delay between Spark job finish time and result displayed on the UI?
3. What are the best values for number of executors and number of cores for my cluster? I have a total of 96 processors spread among 4 datanodes. Will changing these numbers help reduce the delay mentioned in question 2 ?
Pointers will be highly appreciated. I am stuck and am not sure how to proceed. Thanks in advance!