Support Questions

Find answers, ask questions, and share your expertise

Rolling Upgrade Hadoop Cluster Questions

New Contributor

I work in a big telecom shop. One of main Hadoop clusters (HDP) has about 600 nodes. It upgrades almost monthly plus some other maintenance. Every time doing so takes hours to a couple of days and all apps running on it have to be shut off. I just cannot imagine the clusters performing such important work in other companies will get interrupted so often and so long. I asked why don't we do rolling upgrade? Here is one of main architect's answer. Is it true? How about the upgrades in your company?

Regarding rolling upgrades, I want to be careful that everyone understand what happens during this process. Up to 12 nodes per hour get upgraded to the next version of HDP. As this process continues with each passing hour the capacity of the cluster is reduced by X number of nodes that get completed. When the cluster gets in the neighborhood of 75% a restart is required for most of the services. The core services are handled under the up-time such as MapReduce, HDFS, Name Node HA, Resource Manager HA, Zookeeper and Hive HA if it is configured. Spark, Kafka, Storm and the other services are not included in Rolling upgrade with no downtime. Express upgrade has allowed our team to upgrade the clusters in a much faster timeframe. The last upgrade of the cluster was 5 hours. I believe the issue of downtime you stated above with 2 days and 4 hours would not be correct for the actual HDP downtime. This is likely the entire maintenance which would include Ambari Upgrades, HDP upgrades, stopping jobs, sanity checks, and restarting all of the jobs to complete catch up with batch processing. I would like to suggest that your team is engaged with the messages that will be sent out and stop your job at the time the upgrade will be executing which would be on Saturday morning. When the upgrade is completed you will be able to start your job again, another notification will be sent out.


@Paul Wu

Pipeline pausing is supported during Rolling Upgrade. Essentially, when a data node is brought down for the upgrade, any clients communicating with it are temporarily paused and resume after the data node is upgraded and brought back up.

For details on how rolling upgrade works take a look at the below:

Blog post on rolling upgrades:

Jira for rolling upgrades

High level design document for rolling upgrades