Created on 10-21-2018 10:52 PM - edited 10-21-2018 10:56 PM
We have a situation where the whole cluster was installed and managed by CM6/CDH6, 1 machine for CM, 4 other machines for CDH, embedded DB is not use, mysql is deployed as external DB. It runs well but then the CM machine crashed due to hardware failure. It there a way to replace the hardware and reinstall teh same version of CM and add existing hosts(datanodes) to the same cluster again?
If only there is a way to re-install the CM machine after it crashes, and be able to add hosts machines to an existing cluster that is previously installed/managed by the same version of CM, it will be sufficient for us.
I tried to add existing hosts(datanodes) but installation stopped with below message at Cluster Installation -> Install Parcels
Src file /opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el6.parcel/CDH-5.15.1-1.cdh5.15.1.p0.4-el6.parcel does not exist
Any suggestion? am I doing right way, is there any othe correct way to achive this?
Created 10-22-2018 09:47 AM
If you lost your database and then reinstalled CM, the agents will not complete the heartbeat to the new CM since the cm_guid does not match the value in CM.
To correct this, on all hosts with agents running:
- # rm /var/lib/cloudera-scm-agent/cm_guid
- # service cloudera-scm-agent restart
I think the reason you are seeing those errors in the parcels page is because the agents are in bad health... due to the cm_guid.
the cm_guid is generated by CM and the agent stores it to make sure the agent does not communicate with a CM / database that is unexpeted. The process of removing it will allow the agent to see that it should now accept communication with the new CM server/db that you have.
Created 10-22-2018 09:47 AM
If you lost your database and then reinstalled CM, the agents will not complete the heartbeat to the new CM since the cm_guid does not match the value in CM.
To correct this, on all hosts with agents running:
- # rm /var/lib/cloudera-scm-agent/cm_guid
- # service cloudera-scm-agent restart
I think the reason you are seeing those errors in the parcels page is because the agents are in bad health... due to the cm_guid.
the cm_guid is generated by CM and the agent stores it to make sure the agent does not communicate with a CM / database that is unexpeted. The process of removing it will allow the agent to see that it should now accept communication with the new CM server/db that you have.
Created 10-22-2018 10:58 AM