Support Questions

Find answers, ask questions, and share your expertise

Copy Files from Linux to HDFS - individually vs in batch

avatar
Expert Contributor

I am copying a large number of small files (hl7 message files) from Linux local storage to hdfs. I wonder whether this is a performance difference between copying files one by one (though a script) or just using one statement like "hadoop fs -put ./* /hadoop_path".

 

Additional background info: some files have space in their file name, if I use the command "hadoop fs -put ./* /hadoop_path", I got the error "put: unexpected URISyntaxException"
for those files. If there is no performance difference, I would just copy file one at a time and my script replaces the space with "%20". Otherwise, I have to rename all files, replacing spaces with underscores, and then use batch copy.

4 REPLIES 4

avatar
Master Mentor

@Seaport 

It shouldn't be strange to you that Hadoop doesn't perform well with small files, now with that in mind the best solution would be to zip all your small files locally and then copy the zipped file to hdfs using copyFromLocal there is one restriction that is the source of the files can only be on  a local file system. I assume the local Linux box had is the edge node and had the hdfs client installed. If not you will have to copy the myzipped.gz to a node usually the edge node and perform the below steps

$ hdfs dfs -copyFromLocal myzipped.gz /hadoop_path".

Then unzip the myzipped.gz gzipped file in HDFS using

$ hdfs dfs -cat /hadoop_path/myzipped.gz | gzip -d | hdfs dfs -put - /hadoop_path2

Hope that helps

avatar
Expert Contributor

@Shelton Thanks for the quick response.  Here is my code to create the gz file.

 

tar cvzf  ~/stage1.tar.gz ./*

 

 I tried the following command to upload and unzip it into a hdfs directory /user/testuser/test3

 

hdfs dfs -copyFromLocal stage1.tar.gz /user/testuser

hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3

 

However, what I got in /user/testuser/test3 is a file with the name "-", not the multiple files in the stage1.tar.gz. Does your solution mean to concatenate all files together?

Please advise. Thanks.

 

 

 

 

avatar
Master Mentor

@Seaport 

I would think there is a typo error  the dash  [-] after - put  and before the hdfs path 

hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3

  try this after removing the dash -

 

hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put /user/testuser/test3

 Hope that helps 

avatar
Expert Contributor

The unpack command will not work without that extra dash.

https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop/4370...

I had another try with a file name as the destination.

 

hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3/stage1

 

the file stage1 appeared in the test3 directory. There is something interesting.

  • The stage1.tar.gz contains three empty txt files.
  • "hdfs dfs -cat /user/testuser/test3/-" ouptut nothing and the file size is 0.1k
  • "hdfs dfs -cat /user/testuser/test3/stage1" output some texts including original file names inside. Also the file size is 10k.