Support Questions

Find answers, ask questions, and share your expertise

Taking long time to copy files from hdfs

avatar

Hi,

I am running a client in a different network and the hadoop cluster is in a different network.When i am trying to copy 60 MB of data(300 small files) from hdfs to the client machine, it is almost taking 20 minutes and do see a warning like "Input stream closed". is this because of a network between the client and the cluster or will there be anything that i need to look on.

1 ACCEPTED SOLUTION

avatar
Master Guru

How do you copy the small files? Are you running one hadoop fs -put for every small file? ( for example in a shell script ). Then I would expect bad performance because the hadoop client is a java application and needs some setup time for each command.

If you run it in a single put command this would be very bad performance. I normally get 200-300GB/hour. So 60MB should be done in seconds. I would check network speed by doing a simple scp from your client to a node of the cluster.

Regarding small files:

- A put of small files is definitely slower than a put of one big file but it shouldn't be 20 minutes. I once benchmarked it and I think it was 2-3 times slower to write very small files.

- Why do you copy such tiny files into HDFS? This is bad for hadoop in general. Try to find a way to merge them. ( if its data files, if they are oozie definitions or so its obviously different.

The input stream closed is by itself not dangerous. Normal put commands can show it in many scenarios ( a minor bug added to hdfs and fixed now ).

View solution in original post

8 REPLIES 8

avatar
Master Mentor

@ARUNKUMAR RAMASAMY

How are you copying the files? Time taken depends on factors like networks speed, system load and mechanism to download files.

Now, you are communicating over different vlans then it does add some overhead and other networking settings configured at the network , time outs etc

avatar

Hi @Neeraj Sabharwal, we are using just a plain hdfs get command.

avatar
Master Mentor

avatar
Master Mentor
@ARUNKUMAR RAMASAMY

copy command is slower than for example move or distcp. Zipping the 300 files into 1 larger file would make things better for you as Hadoop likes large individual files over many files/directories. You can use merge command, maybe compress and take a look at Hadoop Archive format, then try copying again.

avatar
Master Guru

How do you copy the small files? Are you running one hadoop fs -put for every small file? ( for example in a shell script ). Then I would expect bad performance because the hadoop client is a java application and needs some setup time for each command.

If you run it in a single put command this would be very bad performance. I normally get 200-300GB/hour. So 60MB should be done in seconds. I would check network speed by doing a simple scp from your client to a node of the cluster.

Regarding small files:

- A put of small files is definitely slower than a put of one big file but it shouldn't be 20 minutes. I once benchmarked it and I think it was 2-3 times slower to write very small files.

- Why do you copy such tiny files into HDFS? This is bad for hadoop in general. Try to find a way to merge them. ( if its data files, if they are oozie definitions or so its obviously different.

The input stream closed is by itself not dangerous. Normal put commands can show it in many scenarios ( a minor bug added to hdfs and fixed now ).

avatar

we dont copy small files into hdfs. A MR job runs and creates small files based on the operation. Then these files are copied (using hdfs get) to the client machine and then uploaded into a MYSQL DB. This is a legacy process and i am just new to the stuff. Trying to find out the reasons.

Also @Benjamin Leonhardi do you know the bug # for the HDFS ?

avatar
Master Mentor

if it's an MR program, you can write out fewer files, consider using smaller number of reducers and use compression. Specifics of which can be a separate question on this website. @ARUNKUMAR RAMASAMY

avatar
Master Guru

Regarding the bug: ( with thanks to @Neeraj Sabharwal )

https://community.hortonworks.com/questions/14383/dfsinputstream-has-been-closed-already.html

So the get is simply a single get on an hdfs folder? Then a slow network connection would be my only guess.