I hope someone can help me with removing a parcel from my 146 node cluster running Cloudera Manager 4.6.3. Cloudera Manager is reporting the parcel is stuck at 96% complete, and any attempt I have made to fix the issue has failed.
By inspecting the page using Firebug, I can see that it's polling /parcels/details for progress regarding the undistribution operation. What's interesting is that in the progress field of the response it is reporting 140 / 146 -- this equates roughly to the 96% complete. Curiously, 6 nodes are abesent from the status, with no indication as to which ones.
I wrote a script to check each of the nodes for the installed parcel, but strangely all are reporting that the parcel has been successfully removed (parcel is not in /opt/cloudera/parcels). Something else that I have tried is to reactivate the parcel through the RESTful interface. Through the RESTful interface I am able to redistribute, and reactivate it, but any attempt to remove the parcel produces the error again.
I have checked / watched the logs /var/log/cloudera-scm-server, and can see the commands passing through, but there is not any helpful information for debugging what exactly is going on.
Has anyone experienced anything similar, or have any suggestions?
So, you have 146 nodes in your cluster? Are they all healthy and have CM Agents heartbeating as they should? (As indicated on the Hosts page in CM). Can you cancel the remove operation?
At a high level, I'd recommend upgrading to a newer version of CM - there have been many substantial improvements in 4.7, 4.8 and now 5.0, including increasing the detail around error reporting and handling in these situations. If the option exists, I'd highly recommend upgrading to 5.0 (which you can do independent of the version of CDH running on your cluster - as long as it's not CDH 3)
Yeah, all 146 are in the cluster, and are reporting healthy.
The remove operation does not have a cancel option, frustratingly.
The plan is to move to a newer version, but as of this time it's not logistically feasible. I am currently in the process of making the CDH3 to 4 upgrade.
As a first step, you should restart all the CM Agents. sometimes they could (in this version) get confused under certain conditions and their internal knowledge of what parcels are present diverges from what's on disk - restarting means they'll initialise themselves based on the actual on-disk state, and should bring you back into sync.