After a major upgrading, our production cluster has some problem to start datanodes.
First datanode failed to start due to OOM. So we raise their heap size from 8G to 40G.
Some datanode tried to connect master1, and some nodes tried to connect master2. Only few of them succeed.
So we cannot disable the safe mode.
Any suggestion is appreciate. Our online services are all down.
Here is some log:
Retrying connect to server: master0/IP:8020. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS)
It turns out datanode consume huge amount of time to upgrading data before sending block info to namenode.
And at present, some are connecting to master1 while some are connecting to master 2