Created on 04-25-2016 11:40 PM
When performing a Rolling or Express Upgrade, failures can naturally happen because large clusters are bound to have problematic hosts.
Here are 10 easy tips to prevent, diagnose and fix errors.
1. Always upgrade Ambari to the most recent version, even if it's a dot release.
Often, there are fixes and optimizations that make the stack upgrade smoother.
2. Ensure all services are up, service checks are passing, there are no critical alerts, etc.
This helps ensure that the cluster is fully operational and helps to isolate any failures.
3. Pre-Install the bits and make sure all hosts have enough disk space. You can check that the version is found on all hosts. E.g.,
hdp-select versions | grep 126.96.36.199 | sort | tail -1
4. Do not ignore warnings. Starting in Ambari 2.2.2, there's a flag in ambari.properties file that allows users to bypass PreCheck errors, make sure it is either not present or set to false,
5. Take a backup of the Ambari database. E.g.,
pg_dump -U ambari ambari > /tmp/ambari_bk.psql mysqldump -u ambari ambari > /tmp/ambari_bk.mysql
6. Rolling Upgrade will pause after 30% of the DataNodes have been upgraded. This allows the customer to run additional jobs and ensure that the partial upgrade is still healthy.
7. If a failure occurs, click on "Retry" and make sure that all other dependent services and masters are up.
Often, a retry will work if the previous command failed due to a timeout, network glitch, host goes down and then comes back up, etc. Capture any logs from both the component that failed and the ambari-agent at /var/lib/ambari-agent/data/output-*.txt and /var/lib/ambari-agent/data/errors-*.txt
8. If the failure requires changing configs or restarting a component on a host, then click on the "Pause" button. This will temporarily suspend the Upgrade/Downgrade and allow the user to change configs, execute other commands, such as restarting services, running service checks, etc. Once done, click on the "Resume" button.
CAUTION: do not ever add or move hosts, add or delete services, enable High Availability, or change topology while the upgrade is in progress.
If cannot Finalize ...
9. Find out the problematic hosts and components. In Ambari 2.0 - 2.2, you can run
SELECT repo_version_id, version, display_name FROM repo_version; -- The state for your version may be in UPGRADING, UPGRADED.-- UPGRADING: some component on a host is still not on the newer version -- UPGRADED: all components on all hosts are on the newer version SELECT version, state FROM cluster_version cv JOIN repo_version rv ON cv.repo_version_id = rv.repo_version_id ORDER BY version DESC; -- Find how many hosts are in each state SELECT version, state, COUNT(*) FROM host_version hv JOIN repo_version rv ON hv.repo_version_id = rv.repo_version_id GROUP BY version, state ORDER BY version DESC, state; -- Find components on hosts still not on the newer version SELECT service_name, component_name, version, host_name FROM hostcomponentstate hcs JOIN hosts h ON hcs.host_id = h.host_id WHERE service_name NOT IN ('AMBARI_METRICS', 'KERBEROS') and component_name NOT IN ('ZKFC') ORDER BY version, service_name, component_name, host_name;
On these hosts, run the following,
1. hdp-select set all <new_version>
2. Restart any components still on the older version (you may have to click on the "Pause" button first).
Once all hosts are on the newer version, then the Cluster Version status should transition to UPGRADED; this will allow you to Finalize the upgrade.
10. If you still run into problems, gather all of the logs, result of the SQL queries, and either email Hortonworks Support or the mailing list of the component it failed on.
Here's another useful query.
Postgres: SELECT u.upgrade_id, u.direction, u.from_version, u.to_version, hrc.request_id, hrc.task_id, substr(g.group_title, 0, 30), substr(i.item_text, 0, 30), hrc.status FROM upgrade_group g JOIN upgrade u ON g.upgrade_id = u.upgrade_id JOIN upgrade_item i ON i.upgrade_group_id = g.upgrade_group_id JOIN host_role_command hrc ON hrc.stage_id = i.stage_id AND hrc.request_id = u.request_id ORDER BY hrc.task_id; MySQL: SELECT u.upgrade_id, u.direction, u.from_version, u.to_version, hrc.request_id, hrc.task_id, left(g.group_title, 30), left(i.item_text, 30), hrc.status FROM upgrade_group g JOIN upgrade u ON g.upgrade_id = u.upgrade_id JOIN upgrade_item i ON i.upgrade_group_id = g.upgrade_group_id JOIN host_role_command hrc ON hrc.stage_id = i.stage_id AND hrc.request_id = u.request_id ORDER BY hrc.task_id;
Have fun upgrading.