Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Merge two folders in HDFS ( like Windows move )

Merge two folders in HDFS ( like Windows move )

I have a scenario where data needs to be split up by date as it arrives. Which means that my pig script divides up data into timestamps using piggybank multistorage. ( we can't use hive partition loading )

I.e. the output of the first script can be

/output1/2015/file1-job1.gz/output1/2016/file1-job1.gz

Now the second job comes and creates

/output2/2016/file1-job2.gz/output2/2017/file1-job2.gz

And I have to merge the files in the end into

/target/2015/file1-job1.gz
/target/2016/file1-job1.gz;file1-job2.gz
/target/2017/file1-job2.gz

Now In windows I could simply copy a folder and he would essentially merge the two folders I had hoped the hdfs mv operation would have a similar option but I didn't see it.

I could change the MultiStorage to write directly into a folder but in that case a failed pig job will leave broken Gzip files around. So i need to write to a temp dir and move the files on success in oozie

So at the moment I am thinking of writing a script that recursively reads all files ( hadoop fs -ls -R /output ) and then moves them one by one into the new directory. Either some python or shell script. Anybody done something similar before?

10 REPLIES 10

Re: Merge two folders in HDFS ( like Windows move )

Re: Merge two folders in HDFS ( like Windows move )

Which merge do you mean? getmerge? That merges files together which I do not want. I think I will need to write a small shell script or python script. It just might be slow because it needs to make one call for each move command and the command line has a bit of an overhead.

Re: Merge two folders in HDFS ( like Windows move )

@Benjamin Leonhardi +1 on your thought.

Re: Merge two folders in HDFS ( like Windows move )

Guru

you won't have such overhead if you do the move within your pig script

Re: Merge two folders in HDFS ( like Windows move )

hey pig has fs commands how cool is that. Thanks for the tip. Unfortunately I can't get it done with pure fs commands since I need to create folders if they are not there and don't know in advance which files are created. My current approach is below in case you have any tips on making it better.

Re: Merge two folders in HDFS ( like Windows move )

Would it help to use glob pattern matching to select all files for a desired date and move them all? Here is an example based on the inputs in your question.

> hdfs dfs -ls -R /output*/*
-rw-r--r--   3 chris supergroup          0 2016-01-14 10:29 /output1/2015/file1-job1.gz
-rw-r--r--   3 chris supergroup          0 2016-01-14 10:29 /output1/2016/file1-job1.gz
-rw-r--r--   3 chris supergroup          0 2016-01-14 10:31 /output2/2016/file1-job2.gz
-rw-r--r--   3 chris supergroup          0 2016-01-14 10:31 /output2/2017/file1-job2.gz

> hdfs dfs -mv /output*/2015/* /target/2015

> hdfs dfs -mv /output*/2016/* /target/2016

> hdfs dfs -mv /output*/2017/* /target/2017

> hdfs dfs -ls -R /target/*/*
-rw-r--r--   3 chris supergroup          0 2016-01-14 10:29 /target/2015/file1-job1.gz
-rw-r--r--   3 chris supergroup          0 2016-01-14 10:29 /target/2016/file1-job1.gz
-rw-r--r--   3 chris supergroup          0 2016-01-14 10:31 /target/2016/file1-job2.gz
-rw-r--r--   3 chris supergroup          0 2016-01-14 10:31 /target/2017/file1-job2.gz

More information on the supported glob pattern syntax is available here:

http://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache....

Re: Merge two folders in HDFS ( like Windows move )

Unfortunately I don't know the subfolders in advance. They are created out of the data. I.e. I don't know if it is 2015 or 2016 inside and the directory structure is much deeper. Like 2015/01/01/09/13 so I need to not only make a move but a mkdir -p as well. I suppose my ugly little python script below is needed. But thanks for the link.

Re: Merge two folders in HDFS ( like Windows move )

Yes, considering all of that, the requirements are more dynamic, so you'll need some amount of external scripting.

Re: Merge two folders in HDFS ( like Windows move )

Mentor
Don't have an account?
Coming from Hortonworks? Activate your account here