Support Questions

Seaport · ‎07-30-2020

I am copying a large number of small files (hl7 message files) from Linux local storage to hdfs. I wonder whether this is a performance difference between copying files one by one (though a script) or just using one statement like "hadoop fs -put ./* /hadoop_path".

Additional background info: some files have space in their file name, if I use the command "hadoop fs -put ./* /hadoop_path", I got the error "put: unexpected URISyntaxException"
for those files. If there is no performance difference, I would just copy file one at a time and my script replaces the space with "%20". Otherwise, I have to rename all files, replacing spaces with underscores, and then use batch copy.

Shelton · ‎07-30-2020

@Seaport

It shouldn't be strange to you that Hadoop doesn't perform well with small files, now with that in mind the best solution would be to zip all your small files locally and then copy the zipped file to hdfs using copyFromLocal there is one restriction that is the source of the files can only be on a local file system. I assume the local Linux box had is the edge node and had the hdfs client installed. If not you will have to copy the myzipped.gz to a node usually the edge node and perform the below steps

$ hdfs dfs -copyFromLocal myzipped.gz /hadoop_path".

Then unzip the myzipped.gz gzipped file in HDFS using

$ hdfs dfs -cat /hadoop_path/myzipped.gz | gzip -d | hdfs dfs -put - /hadoop_path2

Hope that helps

Seaport · ‎07-30-2020

@Shelton Thanks for the quick response. Here is my code to create the gz file.

tar cvzf  ~/stage1.tar.gz ./*

I tried the following command to upload and unzip it into a hdfs directory /user/testuser/test3

hdfs dfs -copyFromLocal stage1.tar.gz /user/testuser

hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3

However, what I got in /user/testuser/test3 is a file with the name "-", not the multiple files in the stage1.tar.gz. Does your solution mean to concatenate all files together?

Please advise. Thanks.

Shelton · ‎07-30-2020

@Seaport

I would think there is a typo error the dash [-] after - put and before the hdfs path

hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3

try this after removing the dash -

hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put /user/testuser/test3

Hope that helps

Seaport · ‎07-30-2020

The unpack command will not work without that extra dash.

https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop/4370...

I had another try with a file name as the destination.

hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3/stage1

the file stage1 appeared in the test3 directory. There is something interesting.

The stage1.tar.gz contains three empty txt files.
"hdfs dfs -cat /user/testuser/test3/-" ouptut nothing and the file size is 0.1k
"hdfs dfs -cat /user/testuser/test3/stage1" output some texts including original file names inside. Also the file size is 10k.

Cloudera Community

Support Questions

Copy Files from Linux to HDFS - individually vs in batch