Reply
Highlighted
New Contributor
Posts: 5
Registered: ‎12-14-2018

When the disk occured the bad sector, the hdfs and hbase log have the error messages

We have encountered issues on our cluster that seems to be caused by bad disks. The issues impacted the AP's run on the cluster and let user replied the AP couldn't response the user's read data request.

Our cluster environment includes 15 datanodes - each with 16 SAS disk of 1.2TB on ext4 filesystem and a replication factor of 3. Our hdfs version is the DFS 2.6.0+cdh5.5.4. The hbase version is the HBase 1.0.0+cdh5.5.4.

Someday the disk of 8th datanode occured the error messages. The below is the error messages about bad disks.

...

end_request: critical medium error, dev sdj, sector 33554583

EXT4-fs (sdj1): delayed block allocation failed for inode 53882602 at logical offset 0 with max blocks 2048 with error -5

...

In the 8th datanode's hdfs logs we see warnings such as the below messages. The such messages displayed the 175 times when the disk occured the bad from 17:30 to 23:30.

...

2018-12-21 18:25:11,868 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_FDS, blockid: 1175954675, srvID: b71a5942-2c61-432e-8b3f-c9223d0527c4, success: true
2018-12-21 18:25:11,870 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opReadBlock BP-477378406-172.17.37.101-1467787109028:blk_1175954675_104601855 received exception java.io.IOException: Could not create DataChecksum of type 0 with bytesPerChecksum 0
2018-12-21 18:25:11,870 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(172.17.37.111, datanodeUuid=b71a5942-2c61-432e-8b3f-c9223d0527c4, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=cluster14;nsid=1810448338;c=0):Got exception while serving BP-477378406-172.17.37.101-1467787109028:blk_1175954675_104601855 to /172.17.37.111:57826
java.io.IOException: Could not create DataChecksum of type 0 with bytesPerChecksum 0
at org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:125)
at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:175)
at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:140)
at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readDataChecksum(BlockMetadataHeader.java:102)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:302)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:530)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244)
at java.lang.Thread.run(Thread.java:745)
2018-12-21 18:25:11,870 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: cemdn08:50010:DataXceiver error processing READ_BLOCK operation src: /172.17.37.111:57826 dst: /172.17.37.111:50010
java.io.IOException: Could not create DataChecksum of type 0 with bytesPerChecksum 0
at org.apache.hadoop.util.DataChecksum.newDataChecksum(DataChecksum.java:125)
at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:175)
at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readHeader(BlockMetadataHeader.java:140)
at org.apache.hadoop.hdfs.server.datanode.BlockMetadataHeader.readDataChecksum(BlockMetadataHeader.java:102)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:302)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:530)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244)
at java.lang.Thread.run(Thread.java:745)
2018-12-21 18:25:11,902 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_FDS, blockid: 1175947831, srvID: b71a5942-2c61-432e-8b3f-c9223d0527c4, success: true

...

 

And we see warnings such as the below messages in the hbase log of 8th datanode. The such messages displayed the 91 times when the disk occured the bad from 17:30 to 23:30.

...

2018-12-21 18:37:12,669 INFO org.apache.hadoop.hbase.regionserver.HStore: Starting compaction of 3 file(s) in f of LSR_INDEX,466xxxxxxx:922xxxxxxx,1478909856539.8d4cfbb882b3fce2530d504809ddf394. into tmpdir=hdfs://nameservice1/hbase/data/default/LSR_INDEX/8d4cfbb882b3fce2530d504809ddf394/.tmp, totalSize=35.7 M
2018-12-21 18:37:12,712 ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction failed Request = regionName=LSR_INDEX,466xxxxxxx1791:922xxxxxxx,1478909856539.8d4cfbb882b3fce2530d504809ddf394., storeName=f, fileCount=3, fileSize=35.7 M (11.4 M, 12.0 M, 12.3 M), priority=4, time=44854987223299137
java.io.IOException: Could not seek StoreFileScanner[HFileScanner for reader reader=hdfs://nameservice1/hbase/data/default/LSR_INDEX/8d4cfbb882b3fce2530d504809ddf394/f/c3f337357e7b4b338011fd02390d0837, compression=none, cacheConf=blockCache=LruBlockCache{blockCount=22017, currentSize=529459528, freeSize=1967278520, maxSize=2496738048, heapSize=529459528, minSize=2371901184, minFactor=0.95, multiSize=1185950592, multiFactor=0.5, singleSize=592975296, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=466xxxxxxx:922xxxxxxx/f:1545CellID/1545381369014/Put, lastKey=466xxxxxxx:922xxxxxxx/f:1645VOLTE_ERAB_RELEASE/1545384926020/Put, avgKeyLen=65, avgValueLen=4, entries=152250, length=12592273, cur=null] to key /f:/LATEST_TIMESTAMP/DeleteFamily/vlen=0/seqid=0
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:172)
at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:363)
at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:281)
at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:243)
at org.apache.hadoop.hbase.regionserver.compactions.Compactor.createScanner(Compactor.java:342)
at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:88)
at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1233)
at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1777)
at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:520)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Invalid HFile block magic: \x00\x00\x00\x00\x00\x00\x00\x00
at org.apache.hadoop.hbase.io.hfile.BlockType.parse(BlockType.java:154)
at org.apache.hadoop.hbase.io.hfile.BlockType.read(BlockType.java:167)
at org.apache.hadoop.hbase.io.hfile.HFileBlock.<init>(HFileBlock.java:252)
at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1644)
at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1467)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:430)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:870)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:254)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:156)
... 12 more

...

 

Also We checked the AP's error log as the below messages and found the error messages "Caused by: java.io.IOException: java.io.IOException: Could not seek StoreFileScanner". They appeared in the Hbase's log too. But the number of times(about 5) the error message appears in the AP's log was much lower than the number of times(91) Hbase's log appears.

....

[QkCob4l8iM9xDG6PphpoZIDa][ERROR][18/10/22 09:58:47][com.sti.cem.portal.action.ComplaintAction-676][csc04227]-[ComplainAction][showBasic]BigDataException:com.sti.cem.portal.exception.BigDataException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=32, exceptions:
Mon Oct 22 09:58:47 CST 2018, null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68396: row '466xxxxxxx' on table 'LSR_INDEX' at region=LSR_INDEX,466xxxxxxx:922xxxxxxx,1478635524420.e7f95888e5d7f322a40e8e539d4d68f4., hostname=cemdn10,60020,1505974861959, seqNum=20554820

...

Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=68396: row '466977101897942' on table 'LSR_INDEX' at region=LSR_INDEX,466xxxxxxx:922xxxxxxx,1478635524420.e7f95888e5d7f322a40e8e539d4d68f4., hostname=cemdn10,60020,1505974861959, seqNum=20554820
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more
[HFileScanner for reader reader=hdfs://nameservice1/hbase/data/default/LSR_INDEX/e7f95888e5d7f322a40e8e539d4d68f4/f/01fb1150c5604dd787e44b47aceebd96, compression=none, cacheConf=blockCache=LruBlockCache{blockCount=53269, currentSize=1248319520, freeSize=1248418528, maxSize=2496738048, heapSize=1248319520, minSize=2371901184, minFactor=0.95, multiSize=1185950592, multiFactor=0.5, singleSize=592975296, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=466xxxxxxx:922xxxxxxx/f:0830CellID/1540171211958/Put, lastKey=466xxxxxxx:922xxxxxxx/f:2345VOLTE_ERAB_RELEASE/1540139796879/Put, avgKeyLen=65, avgValueLen=4, entries=865015, length=68091187, cur=null] to key 466xxxxxxx/f:/LATEST_TIMESTAMP/DeleteFamily/vlen=0/seqid=0
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:172)
at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:363)
at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:217)
at org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2033)
at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2023)
at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5145)
at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2464)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2450)
at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2427)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2154)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2034)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
at java.lang.Thread.run(Thread.java:745)

...

 

My Questions are:

1. Did the hbase's error log relates with the hdfs's error log?

2. Due the AP log appeared the few error messages, why did the hbase log appear the many error messages?

3. The replication factor is 3. Why could not the hdfs mechanism provide the other 2 replica to user's read?

 

I would be grateful if someone could throw some light on the behaviour we are seeing in our datanodes. 

Thanks.

Announcements