Support Questions

Find answers, ask questions, and share your expertise

CM machine crashes, re-install CM on new machine and add existing Hosts(datanodes)

avatar
Contributor

 

We have a situation where the whole cluster was installed and managed by CM6/CDH6, 1 machine for CM, 4 other machines for CDH, embedded DB is not use, mysql is deployed as external DB. It runs well but then the CM machine crashed due to hardware failure. It there a way to replace the hardware and reinstall teh same version of CM and add existing hosts(datanodes) to the same cluster again?

 

If only there is a way to re-install the CM machine after it crashes, and be able to add hosts machines to an existing cluster that is previously installed/managed by the same version of CM, it will be sufficient for us.

 

I tried to add existing hosts(datanodes) but installation stopped with below message at Cluster Installation -> Install Parcels
Src file /opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el6.parcel/CDH-5.15.1-1.cdh5.15.1.p0.4-el6.parcel does not exist

ParcelError.png

 

Any suggestion? am I doing right way, is there any othe correct way to achive this?

1 ACCEPTED SOLUTION

avatar
Master Guru

@manjj,

 

If you lost your database and then reinstalled CM, the agents will not complete the heartbeat to the new CM since the cm_guid does not match the value in CM.

 

To correct this, on all hosts with agents running:

 

- # rm /var/lib/cloudera-scm-agent/cm_guid

- # service cloudera-scm-agent restart

 

I think the reason you are seeing those errors in the parcels page is because the agents are in bad health... due to the cm_guid.

 

the cm_guid is generated by CM and the agent stores it to make sure the agent does not communicate with a CM / database that is unexpeted.  The process of removing it will allow the agent to see that it should now accept communication with the new CM server/db that you have.

 

 

View solution in original post

2 REPLIES 2

avatar
Master Guru

@manjj,

 

If you lost your database and then reinstalled CM, the agents will not complete the heartbeat to the new CM since the cm_guid does not match the value in CM.

 

To correct this, on all hosts with agents running:

 

- # rm /var/lib/cloudera-scm-agent/cm_guid

- # service cloudera-scm-agent restart

 

I think the reason you are seeing those errors in the parcels page is because the agents are in bad health... due to the cm_guid.

 

the cm_guid is generated by CM and the agent stores it to make sure the agent does not communicate with a CM / database that is unexpeted.  The process of removing it will allow the agent to see that it should now accept communication with the new CM server/db that you have.

 

 

avatar
Contributor
Thanks a lot, this works for me.