I have a little 3 node cluster based on cloudera framework with Hbase, Solr and Lily configured to mirror data.
All works fine until sometimes one of the HDFS DataNodes goes down with this error:
"The Cloudera Manager Agent is not able to communicate with this role's web server."
I can see the Packet Ack Round Trip average time growning up to 300ms and after that moment solr and hbase are quite unusable
(all infrastrure became really slow)
After some variable times it seems to return all ok but it happens 2 or 3 times a day. It's very hard to use it.
The architecture resides on AmazonWS structure and I cannot see some particular problem on its network.
Can anyone help me
thanks in advance
I attach some screenshot of graphics performances. As you can see there is a moment in wich the garbage collection time and other time data
increase and the same is for JVM Memory usage. This behaviour determines an unexpected exit of the Datanode based on the JVM Memory limit of 1024MB.
I can also see an Host Network Throughput increase to 30M/s.
Concerning the datanode state I can see that I have an unexpected exit in 80% of cases and after 10-15 minutes the note restart.
Obviously these are 10-15 minutes of down because it causes solr problems. Futhermore when it happens sometimes we have corrupted indexes
or non repicated data on solr
Thank you for your answer. during previous week we thought about the same problem due to increasing number of little file.
Now we are moving these little files to other backup systems and increase heap memory on datanodes.
Is simply deleting file solution for our problem? Do we need to run specific commands to replace file or something else after deleting?