I have a scenario where data needs to be split up by date as it arrives. Which means that my pig script divides up data into timestamps using piggybank multistorage. ( we can't use hive partition loading )
I.e. the output of the first script can be
Now the second job comes and creates
And I have to merge the files in the end into
/target/2015/file1-job1.gz /target/2016/file1-job1.gz;file1-job2.gz /target/2017/file1-job2.gz
Now In windows I could simply copy a folder and he would essentially merge the two folders I had hoped the hdfs mv operation would have a similar option but I didn't see it.
I could change the MultiStorage to write directly into a folder but in that case a failed pig job will leave broken Gzip files around. So i need to write to a temp dir and move the files on success in oozie
So at the moment I am thinking of writing a script that recursively reads all files ( hadoop fs -ls -R /output ) and then moves them one by one into the new directory. Either some python or shell script. Anybody done something similar before?
Which merge do you mean? getmerge? That merges files together which I do not want. I think I will need to write a small shell script or python script. It just might be slow because it needs to make one call for each move command and the command line has a bit of an overhead.
Would it help to use glob pattern matching to select all files for a desired date and move them all? Here is an example based on the inputs in your question.
> hdfs dfs -ls -R /output*/* -rw-r--r-- 3 chris supergroup 0 2016-01-14 10:29 /output1/2015/file1-job1.gz -rw-r--r-- 3 chris supergroup 0 2016-01-14 10:29 /output1/2016/file1-job1.gz -rw-r--r-- 3 chris supergroup 0 2016-01-14 10:31 /output2/2016/file1-job2.gz -rw-r--r-- 3 chris supergroup 0 2016-01-14 10:31 /output2/2017/file1-job2.gz > hdfs dfs -mv /output*/2015/* /target/2015 > hdfs dfs -mv /output*/2016/* /target/2016 > hdfs dfs -mv /output*/2017/* /target/2017 > hdfs dfs -ls -R /target/*/* -rw-r--r-- 3 chris supergroup 0 2016-01-14 10:29 /target/2015/file1-job1.gz -rw-r--r-- 3 chris supergroup 0 2016-01-14 10:29 /target/2016/file1-job1.gz -rw-r--r-- 3 chris supergroup 0 2016-01-14 10:31 /target/2016/file1-job2.gz -rw-r--r-- 3 chris supergroup 0 2016-01-14 10:31 /target/2017/file1-job2.gz
More information on the supported glob pattern syntax is available here:
Unfortunately I don't know the subfolders in advance. They are created out of the data. I.e. I don't know if it is 2015 or 2016 inside and the directory structure is much deeper. Like 2015/01/01/09/13 so I need to not only make a move but a mkdir -p as well. I suppose my ugly little python script below is needed. But thanks for the link.
Yes, considering all of that, the requirements are more dynamic, so you'll need some amount of external scripting.