I can find very little on OS patching methodology. We are required to apply patches monthly.
I have been trying via CM API to start and stop roles on a server before patching and rebooting.
The issue occurs after the reboot. CM labels all stopped roles as FATAL after a reboot. If the roles are running prior to reboot, they restart, but if they are stopped first, they do not come back after the reboot as stopped. The main issue is testing health reports poor health due to FATAL compared to stopped/exited.
What are others doing to patch on a regular basis?
The mgmt service roles are running on their own host (not the one being patched) and are fine. The issue is that if you restart a host with stopped roles, you are guaranteed to have services with health issues after the reboot. If the roles are running before the reboot, then the services recover. CM cannot keep track that roles are stopped. Stopping roles does not cause health issues, but rebooting with stopped roles does. For now I am forgoing the health check after the reboot and just starting all roles on a host no matter the state.
My method is working on a small test cluster but I am leary to start using the process on our production cluster and doing OS patching in a rolling manner without taking any downtime.
To veer a little from the topic, I have not found a way via the API to access the mgmt service. The mgmt service roles return with a hosts's host.roleRefs but are not accessible as they have no handle (that I can figure out via the API)
'cluster.get_service(rref.serviceName).stop_roles(rref.roleName)' does not work as the mgmt service is not part of 'cluster'. I can get a ref to the role but I cannot actually access the role. Maybe someone knows what I am missing and can point to a link or share the secret.