Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Data Nodes displaying incorrect block report

avatar
Contributor

I am getting a strange issue with 3 out of 8 data nodes in our HDP 2.6.0 cluster. These 3 data nodes are not reporting the correct number of blocks and also not sending the block reports to name node on regular intervals.

Ambari reporting :

[Alert][datanode_storage] Unable to extract JSON from JMX response

Any suggestion what is wrong with our cluster?

Thanks in advance for your assistance.


namenode-ui.pngdatanode-ui.pngdata-node-jmx.png
1 ACCEPTED SOLUTION

avatar
Master Mentor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
10 REPLIES 10

avatar
Master Mentor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
New Contributor
Services Reporting Alerts
UNKNOWN[AMBARI_METRICS]
AMBARI_METRICS
UNKNOWNMetrics Collector - HBase Master CPU Utilization
[Alert][ams_metrics_collector_hbase_master_cpu] Unable to extract JSON from JMX response

 

Is this big issue? anything worry about it?

another alert is 

Services Reporting Alerts
OK[AMBARI_METRICS]
AMBARI_METRICS
OKMetrics Collector - HBase Master CPU Utilization
12 CPU, load 12.8%

 

 

Please advice me I am waiting for you kind response.

 

 

avatar
Contributor

Thank you so much @Geoffrey Shelton Okot for assistance on this. I really appreciate it.

1. MTU setting is same for all our data nodes. I have verified it.

2. I have performed testdfsio test .Pls see the attachment for test results.

3. Enable GC debugging.my hadoop-env template looks like below.

export HADOOP_DATANODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=800m -XX:MaxNewSize=800m -XX:PermSize=128m -XX:MaxPermSize=256m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms{{dtnode_heapsize}} -Xmx{{dtnode_heapsize}} -Dhadoop.security.logger=ERROR,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT ${HADOOP_DATANODE_OPTS} -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseParNewGC"

After enabling GC debugging & restarting name nodes and data nodes.Below alarm disappeared

Unable to extract JSON from JMX response error

But now ,I am getting below error now on problematic data node in hadoop-hdfs-datanode-.log

2018-05-30 19:53:32,985 WARN datanode.DataNode (BPServiceActor.java:offerService(673)) - IOException in offerService java.io.EOFException: End of File Exception between local host is: "datanodehost/"; destination host is: "Namenodehost":8020; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.GeneratedConstructorAccessor15.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558) at org.apache.hadoop.ipc.Client.call(Client.java:1498) at org.apache.hadoop.ipc.Client.call(Client.java:1398) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy15.blockReport(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:211) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:374) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:645) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:785) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFExceptionat java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1119) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1014) 2018-05-30 19:53:33,100 INFO datanode.DataNode (DataXceiver.java:writeBlock(669)) - Receiving BP-1033621575--1507285615620:blk_1461467777_387788610 src: /:42658 dest: /:50010 2018-05-30 19:53:33,878 INFO datanode.DataNode (DataXceiver.java:writeBlock(669)) - Receiving BP-1033621575--1507285615620:blk_1461467782_387788615 src: /:43782 dest: /:50010 2018-05-30 19:53:36,197 INFO datanode.DataNode (DataXceiver.java:writeBlock(669)) - Receiving BP-1033621575--1507285615620:blk_1368137451_294431710 src: /:52176 dest: /:50010

GC.log

9239114K(31375360K), 0.0954324 secs] [Times: user=0.75 sys=0.00, real=0.10 secs] 2018-05-30T20:37:23.000+0200: 15180.545: [GC (Allocation Failure) 2018-05-30T20:37:23.000+0200: 15180.545: [ParNew: 733378K->81919K(737280K), 0.0994234 secs] 9892898K->9739137K(31375360K), 0.0996623 secs] [Times: user=0.78 sys=0.01, real=0.10 secs] 2018-05-30T20:37:29.962+0200: 15187.508: [GC (Allocation Failure) 2018-05-30T20:37:29.963+0200: 15187.508: [ParNew: 727808K->81689K(737280K), 0.1043798 secs] 10385026K->10379938K(31375360K), 0.1046235 secs] [Times: user=0.83 sys=0.00, real=0.11 secs] 2018-05-30T20:37:33.884+0200: 15191.430: [GC (Allocation Failure) 2018-05-30T20:37:33.885+0200: 15191.430: [ParNew: 733664K->81919K(737280K), 0.1201577 secs] 11031913K->10881691K(31375360K), 0.1203890 secs] [Times: user=0.95 sys=0.00, real=0.12 secs] 2018-05-30T20:37:41.029+0200: 15198.574: [GC (Allocation Failure) 2018-05-30T20:37:41.029+0200: 15198.575: [ParNew: 727734K->78326K(737280K), 0.1015139 secs] 11527506K->11522912K(31375360K), 0.1017500 secs] [Times: user=0.81 sys=0.00, real=0.10 secs] 2018-05-30T20:37:44.780+0200: 15202.325: [GC (Allocation Failure) 2018-05-30T20:37:44.780+0200: 15202.325: [ParNew: 730789K->81920K(737280K), 0.0937630 secs] 12175374K->12020024K(31375360K), 0.0939903 secs] [Times: user=0.74 sys=0.00, real=0.09 secs] 2018-05-30T20:37:51.818+0200: 15209.363: [GC (Allocation Failure) 2018-05-30T20:37:51.818+0200: 15209.363: [ParNew: 723037K->78409K(737280K), 0.1089323 secs] 12661141K->12638859K(31375360K), 0.1091735 secs] [Times: user=0.87 sys=0.01, real=0.11 secs] 2018-05-30T20:37:55.071+0200: 15212.616: [GC (Allocation Failure) 2018-05-30T20:37:55.071+0200: 15212.616: [ParNew: 733424K->81919K(737280K), 0.0912281 secs] 13293874K->13139143K(31375360K), 0.0914462 secs] [Times: user=0.72 sys=0.00, real=0.09 secs] 2018-05-30T20:38:02.582+0200: 15220.127: [GC (Allocation Failure) 2018-05-30T20:38:02.582+0200: 15220.127: [ParNew: 731000K->80436K(737280K), 0.1039197 secs] 13788224K->13781232K(31375360K), 0.1041447 secs] [Times: user=0.82 sys=0.00, real=0.10 secs] 2018-05-30T20:38:05.811+0200: 15223.356: [GC (Allocation Failure) 2018-05-30T20:38:05.811+0200: 15223.356: [ParNew: 734976K->81919K(737280K), 0.0843448 secs] 14435772K->14285826K(31375360K), 0.0845672 secs] [Times: user=0.67 sys=0.00, real=0.09 secs] 2018-05-30T20:38:13.249+0200: 15230.794: [GC (Allocation Failure) 2018-05-30T20:38:13.249+0200: 15230.794: [ParNew: 725770K->80833K(737280K), 0.0967994 secs] 14929677K->14924119K(31375360K), 0.0970191 secs] [Times: user=0.76 sys=0.00, real=0.10 secs] 2018-05-30T20:38:16.685+0200: 15234.231: [GC (Allocation Failure) 2018-05-30T20:38:16.686+0200: 15234.231: [ParNew: 735203K->81920K(737280K), 0.0984436 secs] 15578489K->15419615K(31375360K), 0.0986753 secs] [Times: user=0.78 sys=0.00, real=0.10 secs] 2018-05-30T20:38:24.385+0200: 15241.930: [GC (Allocation Failure) 2018-05-30T20:38:24.385+0200: 15241.930: [ParNew: 735008K->79750K(737280K), 0.0981608 secs] 16072704K->16066284K(31375360K), 0.0983850 secs] [Times: user=0.78 sys=0.00, real=0.09 secs] 2018-05-30T20:38:27.513+0200: 15245.058: [GC (Allocation Failure) 2018-05-30T20:38:27.513+0200: 15245.058: [ParNew: 731825K->81920K(737280K), 0.0928862 secs] 16718359K->16566812K(31375360K), 0.0931079 secs] [Times: user=0.73 sys=0.00, real=0.10 secs] 2018-05-30T20:38:35.118+0200: 15252.664: [GC (Allocation Failure) 2018-05-30T20:38:35.119+0200: 15252.664: [ParNew: 728589K->81823K(737280K), 0.1155139 secs] 17213482K->17208899K(31375360K), 0.1157287 secs] [Times: user=0.91 sys=0.01, real=0.11 secs] 2018-05-30T20:38:39.004+0200: 15256.549: [GC (Allocation Failure) 2018-05-30T20:38:39.004+0200: 15256.549: [ParNew: 735843K->81920K(737280K), 0.0939004 secs] 17862919K->17682067K(31375360K), 0.0941023 secs] [Times: user=0.74 sys=0.00, real=0.10 secs] 2018-05-30T20:38:46.888+0200: 15264.433: [GC (Allocation Failure) 2018-05-30T20:38:46.888+0200: 15264.433: [ParNew: 730708K->78583K(737280K), 0.0952740 secs] 18330855K->18343737K(31375360K), 0.0954785 secs] [Times: user=0.75 sys=0.01, real=0.09 secs]

Issue still persists with data nodes.3 out of 8 data nodes are reporting very less number of blocks

Please assist.

avatar
Master Mentor

@Samant Thakur

There is definitely a network problem with the 3 nodes are they same hardware? NIC and network?

Average IO rate mb/sec: 27.063095092773438 
Average IO rate mb/sec: 19.786481857299805

"Allocation Failure" is a cause of GC cycle to kick.

A GC allocation failure means that the garbage collector could not move objects from young gen to old gen fast enough because it does not have enough memory in old gen. This can cause application slowness.

Whats your DataNode maximum Java heap size?

https://community.hortonworks.com/questions/64677/datanode-heapsize-computation.html

https://community.hortonworks.com/questions/45381/do-i-need-to-tune-java-heap-size.html

https://community.hortonworks.com/questions/78981/data-node-heap-size-warning.html

Do you have a NameNode HA component? if so it may be that failover has occurred -but the client doesn't detect this and retry its operation.

If it isn't a production cluster can you restart all the components?

avatar
Contributor

Thank you! I really appreciate your time and efforts.

1. Data node heap size is 30 GB.My worry is that why only 3 nodes are giving the issue not others if something is wrong with configuration. what is should be ideal heap size for data nodes do you have any idea? I did not find any formula to calculate the heap size for data nodes.

2. We are using name node HA. I suspect that HA switch over might have caused this problem.I have restarted all the components.what should I check for if issue is caused by name node HA.?Name node heap size is 75 GB ..used 70%.

avatar
Master Mentor

@Samant Thakur

Did you go through the links I posted above?

Your data node and Namenode heap sizes need some tuning.

Are you seeing any data node high HEAP SIZE alert?

Estimating NameNode Heap Memory Needed

Memory is estimated by considering the capacity of a cluster. Values are rounded. The below cluster physically stores 4800 TB, or approximately 36 million block files (at the default block size). Replication determines how many namespace blocks represent these block files.

At capacity, with the recommended allocation of 1 GB of memory per million blocks, The Cluster needs 12 GB of maximum heap space.

200 hosts of 24 TB each = 4800 TB.

  • Blocksize=128 MB, Replication=3
  • Cluster capacity in MB: 200 * 24,000,000 MB = 4,800,000,000 MB (4800 TB)
  • Disk space needed per block: 128 MB per block * 3 = 384 MB storage per block
  • Cluster capacity in blocks: 4,800,000,000 MB / 384 MB = 12,000,000 blocks

At capacity, with the recommended allocation of 1 GB of memory per million blocks, This cluster will need 12 GB of maximum heap space.

Hope that helps please revert!

avatar
Contributor

@Geoffrey Shelton Okot

Yes, I have been through the post mentioned by you. We had data nodes failure issues in past ,increase heap size fixed it but I will fine tune them. Below is heap utilization for data node (max heap 30 GB). High heap usage data nodes (marked in red) are the problematic ones.

76545-heap-size.png

Hadoop env

      SHARED_HADOOP_NAMENODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile={{hdfs_log_dir_prefix}}/$USER/hs_err_pid%p.log -XX:NewSize={{namenode_opt_newsize}} -XX:MaxNewSize={{namenode_opt_maxnewsize}} -Xloggc:{{hdfs_log_dir_prefix}}/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Xms{{namenode_heapsize}} -Xmx{{namenode_heapsize}} -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT"
      export HADOOP_NAMENODE_OPTS="${SHARED_HADOOP_NAMENODE_OPTS} -XX:OnOutOfMemoryError=\"/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node\" -Dorg.mortbay.jetty.Request.maxFormContentSize=-1 ${HADOOP_NAMENODE_OPTS}"
      
export HADOOP_DATANODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=800m -XX:MaxNewSize=800m -Xloggc:/var/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms{{dtnode_heapsize}} -Xmx{{dtnode_heapsize}} -Dhadoop.security.logger=ERROR,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT ${HADOOP_DATANODE_OPTS} -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseParNewGC "

export HADOOP_SECONDARYNAMENODE_OPTS="${SHARED_HADOOP_NAMENODE_OPTS} -XX:OnOutOfMemoryError=\"/usr/hdp/current/hadoop-hdfs-secondarynamenode/bin/kill-secondary-name-node\" ${HADOOP_SECONDARYNAMENODE_OPTS}"

You mentioned "A GC allocation failure means that the garbage collector could not move objects from young gen to old gen fast enough because it does not have enough memory in old gen.

which parameter holds values for old gen?

we have got 8 data nodes, CPU 2*8 ,memory 256 GB, Disk -12*6 =72 TB

8 hosts of 72 TB each = 576 TB.

  • our cluster Blocksize=128 MB, Replication=3
  • Cluster capacity in MB: 8* 72,000,000 MB = 576,000,000 MB (576TB)
  • Disk space needed per block: 128 MB per block * 3 = 384 MB storage per block
  • Cluster capacity in blocks: 576,000,000 MB / 384 MB = 1,500,000 blocks

But ambari is reporting 156,710872 blocks, am I missing something here?

76547-blocks.png

Await for your response. Thank you so much!


blocks.png

avatar
Master Mentor

@Samant Thakur

Have you configured your cluster for rack awareness?

  • Rack awareness prevents data loss
  • Rack awareness improves network performance

HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides data availability in the event of a network switch failure or partition within the cluster.

You will need the help of your network/data center team to share the network topology and how the nodes are spread out in the racks.

You can use Ambari UI --> Hosts to set the rack topology after knowing the subnets and DC setup. To understand better see HDP rack awareness also see HCC rack-awareness-series-1 and HCC rack-awareness-series-2

Hope that helps

avatar
Contributor

@Geoffrey Shelton Okot ,Thank you so much for getting back to me.

We don't have rack awareness enabled on our DR cluster as it's 8 data nodes cluster only. we do have rack awareness in our production cluster.

We can enable rack awareness later but my first priority is to get back the blocks on data nodes as faulty data nodes are not sending any block report to name node. Here is current status as of today.

76584-data-node-06052018.png

I am still getting the EOFException error on problematic data nodes other data nodes are not giving this error.

I checked with our network team & they said all the data nodes are connected to same NIC and there is no packet loss.

Hardware team found some correctable memory errors but nothing major.

Is there any maximum number of blocks retention limits for a particular data node? I meant that is there any possibility that max. number of blocks retention limit has been exceeded for problematic data nodes & because of that they stopped sending the block report to name node due to some capacity/resource constraints? Please guide.Do I need to report this as a bug to apache foundation?

java.io.EOFException: End of File Exception between local host is: "DATANODE HOST"; destination host is: "NAMENDOE HOST":8020; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException
        at sun.reflect.GeneratedConstructorAccessor14.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:765)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1558)
        at org.apache.hadoop.ipc.Client.call(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1398)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
        at com.sun.proxy.$Proxy15.blockReport(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:211)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:374)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:645)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:785)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1119)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1014)