I work in a big telecom shop. One of main Hadoop clusters (HDP) has
about 600 nodes. It upgrades almost monthly plus some other maintenance.
Every time doing so takes hours to a couple of days and all apps
running on it have to be shut off. I just cannot imagine the clusters
performing such important work in other companies will get interrupted
so often and so long. I asked why don't we do rolling upgrade? Here is
one of main architect's answer. Is it true? How about the upgrades in
Regarding rolling upgrades, I want to be careful that everyone
understand what happens during this process. Up to 12 nodes per hour
get upgraded to the next version of HDP. As this process continues
with each passing hour the capacity of the cluster is reduced by X
number of nodes that get completed. When the cluster gets in the
neighborhood of 75% a restart is required for most of the services.
The core services are handled under the up-time such as MapReduce,
HDFS, Name Node HA, Resource Manager HA, Zookeeper and Hive HA if it
is configured. Spark, Kafka, Storm and the other services are not
included in Rolling upgrade with no downtime. Express upgrade has
allowed our team to upgrade the clusters in a much faster timeframe.
The last upgrade of the cluster was 5 hours. I believe the issue of
downtime you stated above with 2 days and 4 hours would not be correct
for the actual HDP downtime. This is likely the entire maintenance
which would include Ambari Upgrades, HDP upgrades, stopping jobs,
sanity checks, and restarting all of the jobs to complete catch up
with batch processing. I would like to suggest that your team is
engaged with the messages that will be sent out and stop your job at
the time the upgrade will be executing which would be on Saturday
morning. When the upgrade is completed you will be able to start your
job again, another notification will be sent out.
Pipeline pausing is supported during Rolling Upgrade. Essentially, when a data node is brought down for the upgrade, any clients communicating with it are temporarily paused and resume after the data node is upgraded and brought back up.
For details on how rolling upgrade works take a look at the below: