Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

resource manager ha Vs yarn working perserving restarts

avatar
Explorer

When resource manager HA is deployed so that active RM stores state information in zookeeper base path . My question is when RM HA is enabled for resource manager does the working-preserving for yarn should be enabled along with it ?

I thought if RM HA is enabled yarn.resourcemanager. ha. automatic-failover. enabled = true then the yarn.resourcemanager. workingpreserving-recovery. enabled = false. At anytime only one option of the above written should be true.

You are giving Zkaddress, store class , and store parent path in yarn working preserving recovery too. please give me an idea?

1 ACCEPTED SOLUTION

avatar
Master Guru

@sirisha A

Work-preserving ResourceManager restart ensures that applications continuously function during a ResourceManager restart with minimal impact to end-users.

The overall concept is that the ResourceManager preserves application queue state in a pluggable state store, and reloads that state on restart. While the ResourceManager is down, ApplicationMasters and NodeManagers continuously poll the ResourceManager until it restarts.

If you have automatic failover enabled true then this polling time will get reduced and your jobs will resume in short amount of time so I would suggest to have both the options true in the configuration.

Hope this information helps.

View solution in original post

1 REPLY 1

avatar
Master Guru

@sirisha A

Work-preserving ResourceManager restart ensures that applications continuously function during a ResourceManager restart with minimal impact to end-users.

The overall concept is that the ResourceManager preserves application queue state in a pluggable state store, and reloads that state on restart. While the ResourceManager is down, ApplicationMasters and NodeManagers continuously poll the ResourceManager until it restarts.

If you have automatic failover enabled true then this polling time will get reduced and your jobs will resume in short amount of time so I would suggest to have both the options true in the configuration.

Hope this information helps.