Created 07-30-2020 11:31 AM
I am copying a large number of small files (hl7 message files) from Linux local storage to hdfs. I wonder whether this is a performance difference between copying files one by one (though a script) or just using one statement like "hadoop fs -put ./* /hadoop_path".
Additional background info: some files have space in their file name, if I use the command "hadoop fs -put ./* /hadoop_path", I got the error "put: unexpected URISyntaxException"
for those files. If there is no performance difference, I would just copy file one at a time and my script replaces the space with "%20". Otherwise, I have to rename all files, replacing spaces with underscores, and then use batch copy.
Created 07-30-2020 01:41 PM
It shouldn't be strange to you that Hadoop doesn't perform well with small files, now with that in mind the best solution would be to zip all your small files locally and then copy the zipped file to hdfs using copyFromLocal there is one restriction that is the source of the files can only be on a local file system. I assume the local Linux box had is the edge node and had the hdfs client installed. If not you will have to copy the myzipped.gz to a node usually the edge node and perform the below steps
$ hdfs dfs -copyFromLocal myzipped.gz /hadoop_path".
Then unzip the myzipped.gz gzipped file in HDFS using
$ hdfs dfs -cat /hadoop_path/myzipped.gz | gzip -d | hdfs dfs -put - /hadoop_path2
Hope that helps
Created on 07-30-2020 03:01 PM - edited 07-30-2020 03:08 PM
@Shelton Thanks for the quick response. Here is my code to create the gz file.
tar cvzf ~/stage1.tar.gz ./*
I tried the following command to upload and unzip it into a hdfs directory /user/testuser/test3
hdfs dfs -copyFromLocal stage1.tar.gz /user/testuser
hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3
However, what I got in /user/testuser/test3 is a file with the name "-", not the multiple files in the stage1.tar.gz. Does your solution mean to concatenate all files together?
Please advise. Thanks.
Created 07-30-2020 03:19 PM
I would think there is a typo error the dash [-] after - put and before the hdfs path
hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3
try this after removing the dash -
hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put /user/testuser/test3
Hope that helps
Created on 07-30-2020 03:38 PM - edited 07-30-2020 03:39 PM
The unpack command will not work without that extra dash.
I had another try with a file name as the destination.
hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3/stage1
the file stage1 appeared in the test3 directory. There is something interesting.