Created on 11-10-2015 06:46 AM - edited 11-10-2015 06:52 AM
Hello,
I'm having a coordinator that is executed every day that is writing in HBase tables.
Yesterday the job failed because:
12477375 [hconnection-0x17444a28-shared--pool954-t772] INFO org.apache.hadoop.hbase.client.AsyncProcess
- #3792, table=table_name, attempt=31/35 failed 1 ops,
last exception: org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException:
Region table_name,dsqdqs|122A48C3-,1439883135077.f07d81b4d4ff8e9d4170cce187fc2027.
is not online on <IP>,60020,1447053312111 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2762) at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4268) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3476) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30069) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:116) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:96) at java.lang.Thread.run(Thread.java:745)
I did the folowings checks:
- hbase hbck ==> no error
- hbase fsck / ==> no error
- major_compact 'table_name' ==> I managed to run the job
However, even if the workflow finished succesfully, there is no data wroto to hbase tables.
I tried:
- flush 'table_name' ==> didn't changed anything.
Do you have any suggestions on why the data is not wrote?
(I tried the flush command because I supposed that the files are not written)
Created 02-27-2016 09:07 PM
Created 03-06-2016 03:12 AM
Hello,
The row counter was not going successfuly. I manage to make it work only after I manually made a major_compact on this table and I balanced the regions over the Region Servers. However, the tables balancing is taked about 20-30 minutes.
In Cloudera Manager I had the configuration to make a major_compact every 2 days.
How can I be sure that I won't have this situation again?
Thank you!
Created on 03-07-2016 01:42 AM - edited 03-07-2016 02:37 AM
I searched for the root problem and I found this:
DatanodeRegistration(<ip>, datanodeUuid=5a1b56f4-34ac-48da-bfd1-5b8107c26705,
infoPort=50075, ipcPort=50020,
storageInfo=lv=-56;cid=cluster18;nsid=1840079577;c=0): Got exception while serving BP-1623273649-ip-1419337015794:
blk_1076874382_3137087 to /<ip>:49919 java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/<ip>:50010 remote=/<ip>:49919] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:487) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:226) at java.lang.Thread.run(Thread.java:745) View Log File
In fact, the RegionServer Got down for 4 minutes, then the job failed with
NotServingRegionException (the exception that I posted in my first post)
Is the real solution to increase the dfs.datanode.socket.write.timeout as is posted in
http://blog.csdn.net/odailidong/article/details/46433205 and https://issues.apache.org/jira/browse/HDFS-693 ?
In fact what is even strager is that in the error I have
SocketTimeoutException: 480000 millis
And in my actual configuration I have in HDFS -> Service Monitor Client Config Overrides
<property><name>dfs.socket.timeout</name><value>3000</value></property><property><name>dfs.datanode.socket.write.timeout</name><value>3000</value></property><property><name>ipc.client.connect.max.retries</name><value>1</value></property><property><name>fs.permissions.umask-mode</name><value>000</value></property>
Also, in the https://issues.apache.org/jira/browse/HDFS-693 JIRA , Uma Maheswara Rao G said that "In our observation this issue came in long run with huge no of blocks in Data Node".
In my case we have between 56334 and 80512 blocks per DataNode. Is this considered as huge?
Thank you!