i've added a couple nodes to my cluster with more memory (non-data nodes, only have node manager installed, and assigned a node label: llap
i've created my llap queue, assigned it node label llap, and set it to default.
i also setup a config group for these nodes, and set the max memory approprately, but for some reason, when i try to start hive interactive, it dies saying it doesnt have enough memory.
Component llap: specified memory size (224256) is larger than configured max container memory size (94208)
my other nodes only have 96gb of memory, so it's like when trying to start the service, it's not using the appropriate nodes, how can i get this to use the correct nodes to spin up the service?
@albert_ - Can you check with below command, if the correct nodes are assigned to Node Label.
yarn node -status <Node_ID>
In the configs group, can you check if you have set the appropriate value to the Yarn memory which LLAP is requesting.
yes, the memory for that specific config group (also named llap) is set properly, and the nodes are assigned to that group. also, the max memory for both
yarn.nodemanager.resource.memory-mb, and yarn.scheduler.maximum-allocation-mb is set to 230400 for that group. also confirmed that ONLY the 3 llap nodes are assigned the properlable in resource manager, as well as yarn -status.
@ngarg - so i kind of did the inverse, and i set my default config to have the 225gb max settings. i assigned all other nodes in my cluster a label, and left the 3 new llap nodes to the default node label. i then assigned 100% of the cluster to the llap queue (with no node label) and was able to get it all to spin up. this all seems very wrong, but it's up and working. if i shut it all down, and bring it all back up using node labels for llap, it hangs up after it spins up one container (the am one i believe) and then it eventually times out.
any thoughts? on my llap queue, i have it assigned the llap node, no other queues have the node, and it's also set to default.
i am VERY confused 😞
@ngarg any thoughts? i think the way i got it to work is messing up other things now too. mainly my ats-hbase service seems to not be working correctly. all of my map reduce jobs are taking at minimum 5 minutes because they are timing out reporting to ats. also, my flow activity page on resource manager is broken with a 500 error... i'm wordering if my node label config broke other services now.