Support Questions

Find answers, ask questions, and share your expertise

Copying Data to HDFS takes too much time

avatar
Explorer

@Kuldeep Kulkarni

I have observed that copying data to hdfs takes a lot of time.

I was triyng to copy data to hdfs of a 100mb file.

I never experienced this issue earlier.

I have found the following logs on my datanode.

76 INFO datanode.DataNode (DataXceiver.java:writeBlock(658)) - Receiving BP-1475253775-10.200.146.164-1463754036445:blk_1073742241_1417 src: /10.200.146.165:51570 dest: /10.200.146.165:50010 2016-05-24 15:33:03,397 WARN datanode.DataNode (BlockReceiver.java:receivePacket(563)) - Slow BlockReceiver write packet to mirror took 488ms (threshold=300ms) 2016-05-24 15:33:05,175 WARN datanode.DataNode (BlockReceiver.java:receivePacket(563)) - Slow BlockReceiver write packet to mirror took 327ms (threshold=300ms) 2016-05-24 15:33:07,961 WARN datanode.DataNode (BlockReceiver.java:receivePacket(563)) - Slow BlockReceiver write packet to mirror took 334ms (threshold=300ms) 2016-05-24 15:33:11,061 WARN datanode.DataNode (BlockReceiver.java:receivePacket(563)) - Slow BlockReceiver write packet to mirror took 426ms (threshold=300ms) 2016-05-24 15:33:17,277 WARN datanode.DataNode (BlockReceiver.java:receivePacket(563)) - Slow BlockReceiver write packet to mirror took 336ms (threshold=300ms)

9 REPLIES 9

avatar
Super Guru

@hari kiran

There might be network issue

SYMPTOM:
Reading/writing of data with HDFS --put, 

ROOT CAUSE:

The message  "Slow-BlockReceiver-write-packet-to-mirror" is normally an indication that there is a problem with  with underlying networking infrastructure. 

SOLUTION:

Ensure from the OS level that the networking as well as NIC parameters are set up correctly.

                     - verify that MTU value set up as expected
                     - communication mode is correct (full-duplex)
                     - there are not too many errors at the interface level (dropped/overruns)

COMMANDS TO USE FOR DEBUGGING:

#dmesg   <-- identify issues with NIC device driver
#ifconfig -a  <-- MTU, errorrs (dropped packets/buffer overruns)
#ethtool ethX  <--  identify/set speed/ negotiation mode/ duplex setting for the inferface

In addition, running iperf between datanodes will highlight overall networks transfer issues.

#iperf -s (server)  

#iperf -c 10.1.1.1 -f m -d (client)

avatar
Explorer

@Sagar Shimpi

The duplex mode was half on all the machines so i switched it back to full.

And the mtu value was set to 1500 i tried it changing to 9000.

The moment i changed mtu to 9000 every time i copy the data i get bad datanode exception.

avatar
Explorer

@Sagar Shimpi

whenever i run ifconfig -a i get the following output.

eth0 Link encap:Ethernet HWaddr 2C:59:E5:3A:AB:60 inet addr:10.200.146.164 Bcast:10.200.146.191 Mask:255.255.255.224 inet6 addr: fe80::2e59:e5ff:fe3a:ab60/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:133963209 errors:0 dropped:0 overruns:0 frame:0 TX packets:122587120 errors:5654707 dropped:0 overruns:0 carrier:56547

So in TX packets i see a lot of errors will that cause any problem??

avatar
Super Guru

@hari kiran

Yes. There seems to be some problem. Below is few details about Tx Packets and errors -

TX packets indicate the total number of transmitted packets. TX errors present a summation of errors encountered while transmitting packets. This list includes errors due to the transmission being aborted, errors due to the carrier, fifo errors, heartbeat errors, and window errors. This particular struct in the source code isn’t commented. We also have itemized error counts for dropped, overruns, and carrier. collisions is the number of transmissions terminated due to CSMA/CD (Carrier Sense Multiple Access with Collision Detection).

I will suggest few basic trouble shooting before you escalate this issue to network team -

http://www.tuxradar.com/content/diagnose-and-fix-network-problems-yourself

# ethtool -i eth0

#ethtool -a eth0

#ethtool -g eth0

#ethtool -S eth0

I will not suggest to change the default values for nic unless you know if the nic support those. Else this can lead to other issues. The best way is to reach Hardware team with above analysis.

avatar
Master Guru
@hari kiran

This looks like a performance issue:

1. Are you using physical servers or virtual machines?

2. Have you disabled THP(Transparent Huge Pages)?

3. What is MTU set to?

4. How many disks have been configured on each Datanode?

5. Are you using shared disks for datanode storage?

6. Did you check disk I/O using iostat command? If yes then have you noticed high read/writes?

7. If you are using Virtual machines then can you please check if network is working fine? You can check /var/log/messages and output of dmesg command to see if network is okay.

8. Please see what are the values configured for datanode and namenode handler count.

Please also have a look at suggestions given by @Sagar Shimpi

avatar
Explorer

@Kuldeep Kulkarni

Kuldeep i am using physical servers.

I have disabled thp on all the machines.

MTU is set to 1500 on all nodes.

1 disk for hdfs is configured on each node.

I have mounted my disk to a directory /hadoop on the same place where /home is mounted.

The output of iostat in all the nodes is as follows.

Linux 2.6.32-573.26.1.el6.x86_64 (HadoopMaster) 06/01/2016 _x86_64_(8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.94 0.00 0.35 0.00 0.00 97.72 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 3.14 14.90 84.24 34145512 193106040 iostat Linux 2.6.32-573.26.1.el6.x86_64 (HadoopSlave1) 06/01/2016 _x86_64_(8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.88 0.00 0.25 0.00 0.00 98.87 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 3.04 4.15 61.54 9603544 142454696 Linux 2.6.32-573.26.1.el6.x86_64 (HadoopSlave2) 06/01/2016 _x86_64_(24 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.14 0.00 0.19 0.00 0.00 98.67 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 5.07 0.54 361.55 1252416 836898568 sdb 0.00 0.00 0.00 3349 0 Linux 2.6.32-573.26.1.el6.x86_64 (HadoopSlave3) 06/01/2016 _x86_64_(8 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.91 0.00 0.19 0.06 0.00 98.83 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 1.82 0.57 53.04 1319364 122732448 Linux 2.6.32-573.26.1.el6.x86_64 (HadoopSlave4) 06/01/2016 _x86_64_(2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 3.31 0.00 0.84 0.00 0.00 95.85 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 1.43 0.57 39.10 1308076 89922904

avatar
@hari kiran

what is the version of HDP you are using, and was everything working fine in terms of slowness, before u tried to copy the 100 MB file to HDFS?

avatar
Explorer

@nmaheshwari

I am currently using hdp2.4

Yeah all my hive queries do get executed and all the files are also copied but they do take a lot of time.

Even simple "select count(1) from a very small set of table takes a lot of time.

avatar
Super Guru
@hari kiran

You have lots of transmitted packet errors which will definitely cause performance degrade. It might be occurring due to many issues like faulty NIC, faulty cable,RJ5 connector, duplex setting or some other network layer things.

Please ask your OS team to rectify this issue and see if you will get improvement in hdfs writes.

Meantime can you share the duplex setting?

ethtool eth0