Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

CM machine crashes, re-install CM on new machine and add existing Hosts(datanodes)

avatar
Contributor

 

We have a situation where the whole cluster was installed and managed by CM6/CDH6, 1 machine for CM, 4 other machines for CDH, embedded DB is not use, mysql is deployed as external DB. It runs well but then the CM machine crashed due to hardware failure. It there a way to replace the hardware and reinstall teh same version of CM and add existing hosts(datanodes) to the same cluster again?

 

If only there is a way to re-install the CM machine after it crashes, and be able to add hosts machines to an existing cluster that is previously installed/managed by the same version of CM, it will be sufficient for us.

 

I tried to add existing hosts(datanodes) but installation stopped with below message at Cluster Installation -> Install Parcels
Src file /opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el6.parcel/CDH-5.15.1-1.cdh5.15.1.p0.4-el6.parcel does not exist

ParcelError.png

 

Any suggestion? am I doing right way, is there any othe correct way to achive this?

1 ACCEPTED SOLUTION

avatar
Master Guru

@manjj,

 

If you lost your database and then reinstalled CM, the agents will not complete the heartbeat to the new CM since the cm_guid does not match the value in CM.

 

To correct this, on all hosts with agents running:

 

- # rm /var/lib/cloudera-scm-agent/cm_guid

- # service cloudera-scm-agent restart

 

I think the reason you are seeing those errors in the parcels page is because the agents are in bad health... due to the cm_guid.

 

the cm_guid is generated by CM and the agent stores it to make sure the agent does not communicate with a CM / database that is unexpeted.  The process of removing it will allow the agent to see that it should now accept communication with the new CM server/db that you have.

 

 

View solution in original post

2 REPLIES 2

avatar
Master Guru

@manjj,

 

If you lost your database and then reinstalled CM, the agents will not complete the heartbeat to the new CM since the cm_guid does not match the value in CM.

 

To correct this, on all hosts with agents running:

 

- # rm /var/lib/cloudera-scm-agent/cm_guid

- # service cloudera-scm-agent restart

 

I think the reason you are seeing those errors in the parcels page is because the agents are in bad health... due to the cm_guid.

 

the cm_guid is generated by CM and the agent stores it to make sure the agent does not communicate with a CM / database that is unexpeted.  The process of removing it will allow the agent to see that it should now accept communication with the new CM server/db that you have.

 

 

avatar
Contributor
Thanks a lot, this works for me.