Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hbase Region server down frequently

avatar
Contributor

hbase-error.txtWe are having 7node cluster, in which 5node have region server running, on daily basis any one of the region server node goes down. and im getting the same error from the respective nodes.please find the attached log.

6 REPLIES 6

avatar

@Mathi Murugan

Are all of your data nodes healthy and have enough available disk space? For some reasons writing block to one of them fails and beacuse your replication factor is 2 and replace-datanode-on-failure.policy=DEFAULT, NN will not try another DN and write fails. So, first make sure your DNs are all right. If they look good then try to set

  1. dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS
  2. dfs.client.block.write.replace-datanode-on-failure.best-effort=true

avatar
Contributor

All of my datanodes are healthy and having enough space.My replication factor is 3 as default. By setting dfs.client.block.write.replace-datanode-on-failure.best-effort=true, will not result in data loss?.kindly suggest

avatar
Contributor

@nshelke

All of my datanodes are healthy and having enough space.My replication factor is 3 as default. By setting dfs.client.block.write.replace-datanode-on-failure.best-effort=true, will not result in data loss?.kindly suggest

avatar

@Mathi Murugan

It will not result in data loss, Can you try setting above properties and check again.

avatar
Contributor

hmas.txtregionserver.txtHi nshelke,

I had Set this properties as you mentioned , but Hbase-Master and Region server are getting down and there is a backup process running behind this.It too fails.

dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS

dfs.client.block.write.replace-datanode-on-failure.best-effort=true

Please find the attached Log of Hbase master and Regions server.

Kindly suggest.

avatar
New Contributor

From your logs I see there are no healthy datanodes for it to try replace bad datanodes. In addition I see several slow sync error for which you will have to tune your memstore's lower and upper limit configuration to reduce the frequency of data being flushed in order to get the best out of available heap.