Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Datanode unable to communicate on AmazonWS

Highlighted

Datanode unable to communicate on AmazonWS

Explorer

Dear Support

 

I have a little 3 node cluster based on cloudera framework with Hbase, Solr and Lily configured to mirror data.

All works fine until sometimes one of the HDFS DataNodes goes down with this error:

"The Cloudera Manager Agent is not able to communicate with this role's web server."

I can see the Packet Ack Round Trip average time growning up to 300ms and after that moment solr and hbase are quite unusable

(all infrastrure became really slow)

 

After some variable times it seems to return all ok but it happens 2 or 3 times a day. It's very hard to use it.

 

The architecture resides on AmazonWS structure and I cannot see some particular problem on its network.

 

Can anyone help me

 

thanks in advance

 

David

4 REPLIES 4

Re: Datanode unable to communicate on AmazonWS

Master Guru
The packet ack avg. time measures the time taken for the HDFS packet write acknowledgements between datanodes in a write pipeline. This is a pure network write operation, i.e. no disk or service locks involved. If this grows, it definitely would indicate a delay in transfer and receipt over the network side.

Could you observe your network utilisation graphs to see if any usage spikes are observable during the time?

Also, does your DN crash and have to be restarted, or just its health changes into bad/concerning due to the agent miscommunication?

Re: Datanode unable to communicate on AmazonWS

Explorer

I attach some screenshot of graphics performances. As you can see there is a moment in wich the garbage collection time and other time data

increase and the same is for JVM Memory usage. This behaviour determines an unexpected exit of the Datanode based on the JVM Memory limit of 1024MB.

I can also see an Host Network Throughput increase to 30M/s.

 

Concerning the datanode state I can see that I have an unexpected exit in 80% of cases and after 10-15 minutes the note restart.

Obviously these are 10-15 minutes of down because it causes solr problems. Futhermore when it happens sometimes we have corrupted indexes

or non repicated data on solr

 

Schermata 2015-08-24 alle 16.59.15.pngSchermata 2015-08-24 alle 16.59.24.pngSchermata 2015-08-24 alle 16.58.56.png

Re: Datanode unable to communicate on AmazonWS

Master Guru
Per that GC time growth chart, it does look like your increase in block count overall now demands you specify more heap size for the DataNodes.

Re: Datanode unable to communicate on AmazonWS

Explorer

Thank you for your answer. during previous week we thought about the same problem due to increasing number of little file.

Now we are moving these little files to other backup systems and increase heap memory on datanodes.

 

Is simply deleting file solution for our problem? Do we need  to run specific commands to replace file or something else after deleting?

 

 

 

Don't have an account?
Coming from Hortonworks? Activate your account here