Created 05-24-2016 10:11 AM
I have observed that copying data to hdfs takes a lot of time.
I was triyng to copy data to hdfs of a 100mb file.
I never experienced this issue earlier.
I have found the following logs on my datanode.
76 INFO datanode.DataNode (DataXceiver.java:writeBlock(658)) - Receiving BP-1475253775-10.200.146.164-1463754036445:blk_1073742241_1417 src: /10.200.146.165:51570 dest: /10.200.146.165:50010 2016-05-24 15:33:03,397 WARN datanode.DataNode (BlockReceiver.java:receivePacket(563)) - Slow BlockReceiver write packet to mirror took 488ms (threshold=300ms) 2016-05-24 15:33:05,175 WARN datanode.DataNode (BlockReceiver.java:receivePacket(563)) - Slow BlockReceiver write packet to mirror took 327ms (threshold=300ms) 2016-05-24 15:33:07,961 WARN datanode.DataNode (BlockReceiver.java:receivePacket(563)) - Slow BlockReceiver write packet to mirror took 334ms (threshold=300ms) 2016-05-24 15:33:11,061 WARN datanode.DataNode (BlockReceiver.java:receivePacket(563)) - Slow BlockReceiver write packet to mirror took 426ms (threshold=300ms) 2016-05-24 15:33:17,277 WARN datanode.DataNode (BlockReceiver.java:receivePacket(563)) - Slow BlockReceiver write packet to mirror took 336ms (threshold=300ms)
Created 05-24-2016 10:13 AM
There might be network issue
SYMPTOM: Reading/writing of data with HDFS --put, ROOT CAUSE: The message "Slow-BlockReceiver-write-packet-to-mirror" is normally an indication that there is a problem with with underlying networking infrastructure. SOLUTION: Ensure from the OS level that the networking as well as NIC parameters are set up correctly. - verify that MTU value set up as expected - communication mode is correct (full-duplex) - there are not too many errors at the interface level (dropped/overruns) COMMANDS TO USE FOR DEBUGGING: #dmesg <-- identify issues with NIC device driver #ifconfig -a <-- MTU, errorrs (dropped packets/buffer overruns) #ethtool ethX <-- identify/set speed/ negotiation mode/ duplex setting for the inferface In addition, running iperf between datanodes will highlight overall networks transfer issues. #iperf -s (server) #iperf -c 10.1.1.1 -f m -d (client)
Created 05-31-2016 06:56 PM
The duplex mode was half on all the machines so i switched it back to full.
And the mtu value was set to 1500 i tried it changing to 9000.
The moment i changed mtu to 9000 every time i copy the data i get bad datanode exception.
Created 05-31-2016 07:11 PM
whenever i run ifconfig -a i get the following output.
eth0 Link encap:Ethernet HWaddr 2C:59:E5:3A:AB:60 inet addr:10.200.146.164 Bcast:10.200.146.191 Mask:255.255.255.224 inet6 addr: fe80::2e59:e5ff:fe3a:ab60/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:133963209 errors:0 dropped:0 overruns:0 frame:0 TX packets:122587120 errors:5654707 dropped:0 overruns:0 carrier:56547
So in TX packets i see a lot of errors will that cause any problem??
Created 06-01-2016 05:38 AM
Yes. There seems to be some problem. Below is few details about Tx Packets and errors -
TX packets
indicate the total number of transmitted packets.
TX errors
present a summation of errors encountered while transmitting packets. This list includes errors due to the transmission being aborted, errors due to the carrier, fifo errors, heartbeat errors, and window errors. This particular struct
in the source code isn’t commented.
We also have itemized error counts for dropped
, overruns
, and carrier
.
collisions
is the number of transmissions terminated due to CSMA/CD (Carrier Sense Multiple Access with Collision Detection).
I will suggest few basic trouble shooting before you escalate this issue to network team -
http://www.tuxradar.com/content/diagnose-and-fix-network-problems-yourself
# ethtool -i eth0
#ethtool -a eth0
#ethtool -g eth0
#ethtool -S eth0
I will not suggest to change the default values for nic unless you know if the nic support those. Else this can lead to other issues. The best way is to reach Hardware team with above analysis.
Created 05-24-2016 06:23 PM
This looks like a performance issue:
1. Are you using physical servers or virtual machines?
2. Have you disabled THP(Transparent Huge Pages)?
3. What is MTU set to?
4. How many disks have been configured on each Datanode?
5. Are you using shared disks for datanode storage?
6. Did you check disk I/O using iostat command? If yes then have you noticed high read/writes?
7. If you are using Virtual machines then can you please check if network is working fine? You can check /var/log/messages and output of dmesg command to see if network is okay.
8. Please see what are the values configured for datanode and namenode handler count.
Please also have a look at suggestions given by @Sagar Shimpi
Created 05-31-2016 07:08 PM
Kuldeep i am using physical servers.
I have disabled thp on all the machines.
MTU is set to 1500 on all nodes.
1 disk for hdfs is configured on each node.
I have mounted my disk to a directory /hadoop on the same place where /home is mounted.
The output of iostat in all the nodes is as follows.
Linux 2.6.32-573.26.1.el6.x86_64 (HadoopMaster) 06/01/2016 _x86_64_(8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.94 0.00 0.35 0.00 0.00 97.72 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 3.14 14.90 84.24 34145512 193106040 iostat Linux 2.6.32-573.26.1.el6.x86_64 (HadoopSlave1) 06/01/2016 _x86_64_(8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.88 0.00 0.25 0.00 0.00 98.87 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 3.04 4.15 61.54 9603544 142454696 Linux 2.6.32-573.26.1.el6.x86_64 (HadoopSlave2) 06/01/2016 _x86_64_(24 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.14 0.00 0.19 0.00 0.00 98.67 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 5.07 0.54 361.55 1252416 836898568 sdb 0.00 0.00 0.00 3349 0 Linux 2.6.32-573.26.1.el6.x86_64 (HadoopSlave3) 06/01/2016 _x86_64_(8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.91 0.00 0.19 0.06 0.00 98.83 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 1.82 0.57 53.04 1319364 122732448 Linux 2.6.32-573.26.1.el6.x86_64 (HadoopSlave4) 06/01/2016 _x86_64_(2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 3.31 0.00 0.84 0.00 0.00 95.85 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 1.43 0.57 39.10 1308076 89922904
Created 05-30-2016 09:17 PM
what is the version of HDP you are using, and was everything working fine in terms of slowness, before u tried to copy the 100 MB file to HDFS?
Created 05-31-2016 06:58 PM
I am currently using hdp2.4
Yeah all my hive queries do get executed and all the files are also copied but they do take a lot of time.
Even simple "select count(1) from a very small set of table takes a lot of time.
Created 05-31-2016 07:32 PM
You have lots of transmitted packet errors which will definitely cause performance degrade. It might be occurring due to many issues like faulty NIC, faulty cable,RJ5 connector, duplex setting or some other network layer things.
Please ask your OS team to rectify this issue and see if you will get improvement in hdfs writes.
Meantime can you share the duplex setting?
ethtool eth0