Recently I installed a compute node only with bigger cpu and memory in hopes that it could run the Hive on Spark query faster. I encountered a weird problem.
If using regular hive and mapreduce all nodemanagers (datanode + computenode ) resources are used.
But if I use hive on spark, the computenode only used for 1-2 minutes then it will never being used anymore until the query finished.
Is there any specific config need to be done to use "computenode" with hive on spark?
I'm using CDH 5.8.0-1.cdh5.8.0.p0.42
We have the same issue, and this is apparently an unsupported use-case. It works ok with Hive on Map Reduce -- except that the cluster favors nodes with disks because Hadoop wants data-locality (CPU/RAM/Data all on the same machine). We have observed that Impala just doesn't work -- we can run the daemons on the compute-only boxes (they have enough disk for scratch space). We have observed that Hive/MR will use containers when the system is otherwise completely full, in other words, as a fallback at best. I think this is a charcteristic of YARN so probably applies to Hive on Spark as well.
Sigh -- we could save large amounts of money and take advantage of very high-speed network if only....
How many executors are your requesting and how many cores? Are you using dynamic allocation? Do you see many Spark tasks waiting for executors? tharrison is correct in that like Mapreduce, Spark will prefer data local vs reading the data remotely. As long as the nodes containing the data are not saturated, Spark will use those executors to complete tasks.
The idea is tasks will complete faster if the data is local, but it would not stop tasks from reading data remotely. It sounds like the datanodes are able to handle the workload thus far so it has no need to utilize the compute node. You may need to increase your executors and cores until you saturate the datanodes or there are no tasks waiting to be executed in Spark. Otherwise, you could attempt to use YARN's node labels to force it to use the compute node, but you may see an increase in processing time if cpu characteristics are similar between the datanodes and compute nodes.