Reply
Champion Alumni
Posts: 196
Registered: ‎11-18-2014

HBase Flush?

[ Edited ]

Hello,

 

I'm having a coordinator that is executed every day that is writing in HBase tables.

Yesterday the job failed because:

 

12477375 [hconnection-0x17444a28-shared--pool954-t772] INFO org.apache.hadoop.hbase.client.AsyncProcess  
- #3792, table=table_name, attempt=31/35 failed 1 ops,
last exception: org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException:
Region table_name,dsqdqs|122A48C3-,1439883135077.f07d81b4d4ff8e9d4170cce187fc2027.
is not online on <IP>,60020,1447053312111 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2762) at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4268) at org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:3476) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:30069) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:116) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:96) at java.lang.Thread.run(Thread.java:745)

 

I did the folowings checks:

- hbase hbck ==> no error

- hbase fsck / ==> no error

- major_compact 'table_name' ==> I managed to run the job

 

However, even if the workflow finished succesfully, there is no data wroto to hbase tables.

 

I tried:

- flush 'table_name' ==> didn't changed anything.

 

Do you have any suggestions on why the data is not wrote? 

(I tried the flush command because I supposed that the files are not written)

 

 

 

GHERMAN Alina
Posts: 1,836
Kudos: 415
Solutions: 295
Registered: ‎07-31-2013

Re: HBase Flush?

Are you able to run a org.apache.hadoop.hbase.mapreduce.RowCounter job on this table successfully?

A few reasons you may see NotServingRegionException is when there are ongoing splits, or balancer-invoked moves of regions. However, these should ideally only last under a minute in good cases, so the 35 retries exhausted seems to be in excess.

I'd recommend searching the history of your region ID (f07d81b4d4ff8e9d4170cce187fc2027) in the HMaster log, and then in the RS log of the one that hosted it (per the HMaster Web UI), to see what went wrong (or on) during the timeframe.
Champion Alumni
Posts: 196
Registered: ‎11-18-2014

Re: HBase Flush?

Hello,

 

The row counter was not going successfuly. I manage to make it work only after I manually made a major_compact on this table and I balanced the regions over the Region Servers. However, the tables balancing is taked about 20-30 minutes. 

 

In Cloudera Manager I had the configuration to make a major_compact every 2 days.

 

How can I be sure that I won't have this situation again?

 

Thank you! 

GHERMAN Alina
Highlighted
Champion Alumni
Posts: 196
Registered: ‎11-18-2014

Re: HBase Flush?

[ Edited ]

I searched for the root problem and I found this:

DatanodeRegistration(<ip>, datanodeUuid=5a1b56f4-34ac-48da-bfd1-5b8107c26705, 
infoPort=50075, ipcPort=50020,
storageInfo=lv=-56;cid=cluster18;nsid=1840079577;c=0): Got exception while serving BP-1623273649-ip-1419337015794:
blk_1076874382_3137087 to /<ip>:49919 java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/<ip>:50010 remote=/<ip>:49919] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:487) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:226) at java.lang.Thread.run(Thread.java:745) View Log File

In fact, the RegionServer Got down for 4 minutes, then the job failed with

NotServingRegionException (the exception that I posted in my first post)

 Is the real solution to increase the dfs.datanode.socket.write.timeout as is posted in

http://blog.csdn.net/odailidong/article/details/46433205  and https://issues.apache.org/jira/browse/HDFS-693 ?

 

In fact what is even strager is that in the error I have

SocketTimeoutException: 480000 millis

And in my actual configuration I have in HDFS -> Service Monitor Client Config Overrides

<property><name>dfs.socket.timeout</name><value>3000</value></property><property><name>dfs.datanode.socket.write.timeout</name><value>3000</value></property><property><name>ipc.client.connect.max.retries</name><value>1</value></property><property><name>fs.permissions.umask-mode</name><value>000</value></property>

 

Also, in the https://issues.apache.org/jira/browse/HDFS-693 JIRA , Uma Maheswara Rao G   said that "In our observation this issue came in long run with huge no of blocks in Data Node".

 

In my case we have between 56334 and 80512 blocks per DataNode.  Is this considered as huge?

 

 

Thank you!

GHERMAN Alina
Announcements