Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

accidentally upload small file to hdfs using 60+ seconds

Highlighted

accidentally upload small file to hdfs using 60+ seconds

Explorer

Hi all,

My Hadoop cluster has about 54 million files. Most of times, clients on about 10 PCs can upload files to hdfs successfully in about 100ms, and accidentally some clients may use long about 60+ seconds to upload a file(the size is about 20MB+). The network between client and Hadoop cluster is about Gigabit network. When I run "ping" from client to datanode, there is slight packets lose rate, the rate maybe 0.1%,but always the time is 60+ seconds. Any advices?

Thanks in advances

8 REPLIES 8
Highlighted

Re: accidentally upload small file to hdfs using 60+ seconds

Rising Star

Can you please check the GC logs on the namenode, use a website like GCEasy to get a feel for GC issues. if you specific issues, increase the RAM usage for Namenode.

Here are some recommendations for File Size/Block Size and Memory

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ref-809...

Here is a more in depth resource which deals with GC tuning and Namenode.

https://community.hortonworks.com/articles/14170/namenode-garbage-collection-configuration-best-pra....

Highlighted

Re: accidentally upload small file to hdfs using 60+ seconds

Explorer

Thank you @aengineer

The following are the logs of namenode (HA):

active namenode:

[hdfs@n01 ~]$ jstat -gcutil 4244 2000 2000

S0 S1 E O P YGC YGCT FGC FGCT GCT

5.33 0.00 63.02 45.26 34.72 93224 12395.497 12 11.087 12406.584

5.33 0.00 73.16 45.26 34.72 93224 12395.497 12 11.087 12406.584

5.33 0.00 84.45 45.26 34.72 93224 12395.497 12 11.087 12406.584

5.33 0.00 96.23 45.26 34.72 93224 12395.497 12 11.087 12406.584

standby namenode:

[hdfs@n02 hdfs]$ jstat -gcutil 9621 2000 2000

S0 S1 E O P YGC YGCT FGC FGCT GCT

0.00 2.81 70.93 81.54 34.87 174079 23438.283 20 6.195 23444.478

0.00 2.81 74.50 81.54 34.87 174079 23438.283 20 6.195 23444.478

0.00 2.81 74.64 81.54 34.87 174079 23438.283 20 6.195 23444.478

I think there is no problem with namenode.

Highlighted

Re: accidentally upload small file to hdfs using 60+ seconds

Rising Star

@xu jerry

Just to confirm, this sample was collected when you had 60 second pauses for put file ?

Highlighted

Re: accidentally upload small file to hdfs using 60+ seconds

Explorer

@aengineer

Yes it is, it occurs anytime in every hours now. The logs is following:

2017-01-11 01:55:11,029 client has sended the request to datanode h1d009, and h1d009 receiving the packets from client, but until 2017-01-11 01:56:04,490, datanode show the log of the file detail information. I have checked and there is nothing wrong with datanode of h1d009, and there is no fullgc.

[hdfs@h1d009 hdfs]$ grep blk_1292363673_222747771 hadoop-hdfs-datanode-h1d009.log

2017-01-11 01:55:11,029 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1965373941-192.168.1.132-1446953113751:blk_1292363673_222747771 src: /192.168.3.164:50886 dest: /192.168.1.142:1019

2017-01-11 01:56:04,490 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.3.164:50886, dest: /192.168.1.142:1019, bytes: 134217728, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1834460450_1, offset: 0, srvID: 2d803dfa-8173-4be0-b41e-a08aaa7fb2fd, blockid: BP-1965373941-192.168.1.132-1446953113751:blk_1292363673_222747771, duration: 53458215121

2017-01-11 01:56:04,490 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1965373941-192.168.1.132-1446953113751:blk_1292363673_222747771, type=HAS_DOWNSTREAM_IN_PIPELINE terminating

Highlighted

Re: accidentally upload small file to hdfs using 60+ seconds

Rising Star

@xu jerry

At this point we would not be able to make much progress without actually looking at the logs. It could be any issue, like RPC queue is full or something else. The Pipeline terminating is not the cause is my guess. if you have a support contract, would you please file a ticket with hortonworks support, that will allow us to take a look at the logs to see if there are any issues.

Highlighted

Re: accidentally upload small file to hdfs using 60+ seconds

Explorer

Thank you @aengineer

The following logs is one of the datanode. We can see that the block blk_2045907093_972182808 cannot receive data from client and after 60000 millis timeout, but the following two requests: blk_2045907138_972182853 and blk_2045916198_972191913, datanode can receive data from client, so i think it should not be with RPC queue.

2017-01-16 09:56:54,256 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-328647654-192.168.1.161-1470977918959:blk_2045907093_972182808 src: /192.168.5.132:36890 dest: /192.168.1.162:1019 2017-01-16 09:56:54,577 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-328647654-192.168.1.161-1470977918959:blk_2045907138_972182853 src: /192.168.3.80:37048 dest: /192.168.1.162:1019 2017-01-16 09:56:54,582 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.3.80:37048, dest: /192.168.1.162:1019, by tes: 92030, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_372616400_1, offset: 0, srvID: 8ae45f23-ce9f-4eb0-8312-ae76eb3a20ca, blockid: BP-328647654- 192.168.1.161-1470977918959:blk_2045907138_972182853, duration: 2173162 ...... 2017-01-16 09:57:57,859 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-328647654-192.168.1.161-1470977918959:blk_2045916198_972191913 src: /192.168.1.160:33093 dest: /192.168.1.162:1019 2017-01-16 09:57:57,862 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.1.160:33093, dest: /192.168.1.162:1019, bytes: 96755, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_496079284_1, offset: 0, srvID: 8ae45f23-ce9f-4eb0-8312-ae76eb3a20ca, blockid: BP-328647654-192.168.1.161-1470977918959:blk_2045916198_972191913, duration: 2251546 2017-01-16 09:57:57,862 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-328647654-192.168.1.161-1470977918959:blk_2045916198_972191913, type=LAST_IN_PIPELINE, downstreams=0:[] terminating 2017-01-16 09:57:57,934 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-328647654-192.168.1.161-1470977918959:blk_2045907093_972182808 java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.162:1019 remote=/192.168.5.132:36890] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at java.io.DataInputStream.read(DataInputStream.java:149)

Re: accidentally upload small file to hdfs using 60+ seconds

Explorer

And the jstack of datanode is following: We can see that the receiving thread is on RUNNABLE state, so can I say: there is nothing with datanode, it should be with the client or the network, there is no file stream from client within 60 seconds and after 60 seconds, datanode thread throw a timeout exception:

"DataXceiver for client DFSClient_NONMAPREDUCE_-88028072_1 at /192.168.5.132:36890 [Receiving block BP-328647654-192.168.1.161-1470977918959:blk_2045907093_972182808]" daemon prio=10 tid=0x00007f3b247c0800 nid=0x6bde runnable [0x00007f32d233a000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) - locked <0x00007f339a145af0> (a sun.nio.ch.Util$2) - locked <0x00007f339a145ad8> (a java.util.Collections$UnmodifiableSet) - locked <0x00007f339a1459e8> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) - locked <0x00007f3348a0f0c0> (a java.io.BufferedInputStream) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:730) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235) at java.lang.Thread.run(Thread.java:745) "Async disk worker #22294 for volume /data01/hadoop/hdfs/data/current" daemon prio=10 tid=0x00007f3b14522000 nid=0x6657 waiting on condition [0x00007f32d344b000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f33bf9bedb0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Highlighted

Re: accidentally upload small file to hdfs using 60+ seconds

Explorer

The following log is the jstack output of che hdfs client.

I think it should be for that the packet queue is full, so the client enter into WAITING states . But why the queue is full? Is there something wrong with network?

"upload-thread_31" #45 prio=5 os_prio=57 tid=0x000000012da57800 nid=0x55f03f3 in Object.wait() [0x0000000132e27000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hdfs.DFSOutputStream.waitAndQueueCurrentPacket(DFSOutputStream.java:1833) - locked <0x0700000020917530> (a java.util.LinkedList) at org.apache.hadoop.hdfs.DFSOutputStream.writeChunk(DFSOutputStream.java:1900) - locked <0x0700000020917550> (a org.apache.hadoop.hdfs.DFSOutputStream) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:206) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:163)

the following are the codes of waitAndQueueCurrentPacket:

private void waitAndQueueCurrentPacket() throws IOException {

synchronized (dataQueue) {

try {

// If queue is full, then wait till we have enough space

while (!closed && dataQueue.size() + ackQueue.size() > dfsClient.getConf().writeMaxPackets) {

try {

dataQueue.wait();

} catch (InterruptedException e) {

// If we get interrupted while waiting to queue data, we still need to get rid

// of the current packet. This is because we have an invariant that if

// currentPacket gets full, it will get queued before the next writeChunk.

//

// Rather than wait around for space in the queue, we should instead try to

// return to the caller as soon as possible, even though we slightly overrun

// the MAX_PACKETS length.

Thread.currentThread().interrupt();

break;

}

}

checkClosed();

queueCurrentPacket();

} catch (ClosedChannelException e) {

}

}

}

Don't have an account?
Coming from Hortonworks? Activate your account here