Created 02-03-2016 05:07 PM
I've been trying to find the solution to this problem for a while. I have found that in a normal file system using shell you can use this command to move all files under a location but leave the directories alone.
find .-maxdepth 1-type f -exec mv {} destination_path \;
I was wondering if there is also a command to be able to do the same in hdfs.
So if I have a folder in hdfs called "folder1" which contains the files "copyThis.txt", "copyThisAsWell.txt" and "theFinalCopy.txt" and also contains a folder "doNotCopy" and I want to copy the files into a new folder called "folder2" but leave the folder "doNotCopy" behind, how can this be done in hdfs?
Thanks for any help you can provide.
Created 02-03-2016 05:45 PM
Starting in HDP 2.3, the Hadoop shell ships with a find command. Full details are available in the FileSystemShell find documentation in Apache.
However, unlike the standard Unix command, the Hadoop version does not yet implement the "maxdepth" or "type" options shown in your example. There are several uncommitted patches still in progress to add these features. HADOOP-10578 implements "maxdepth". HADOOP-10579 implements "type". These features are not yet available in any release of either HDP or Apache Hadoop.
Until these features become generally available, I think your only other option is to use wildcard glob matching as suggested in prior answers. I understand you said that there is some variability to the names because of dates and times embedded into the names. You would need to find a way to stage these files in a predictable way, so that you can effectively use a wildcard to match only the files that you want to match. This might require renaming files or moving them into a different directory structure at time of ingest.
Another possible option could be to script it externally, such as by using bash to run an ls -R command, parse the results, and then call the Hadoop shell again using only the files that you want. However, this would introduce overhead from needing to start a separate Hadoop shell process (a JVM) for each command, which might be unacceptable.
Created 02-03-2016 05:07 PM
hdfs dfs -cp folder1/filename filedestination/
specify each file or use wildcard * to copy files. @Daniel Perry
Created 02-03-2016 05:20 PM
My issue is that the files that will be copied across will have the date and time in the filename, they will also be updated daily and so it will be next to impossible to know what the names of the files to be copied will be.
Created 02-03-2016 05:21 PM
@Daniel Perry I recommend to look at Apache Nifi, it has strong features in that regards. You can track what is processed, though maybe not file based but per event, which happens to be any line within a file.
Created 02-04-2016 10:10 AM
I will look into it, thanks for your help
Created 02-03-2016 05:45 PM
Starting in HDP 2.3, the Hadoop shell ships with a find command. Full details are available in the FileSystemShell find documentation in Apache.
However, unlike the standard Unix command, the Hadoop version does not yet implement the "maxdepth" or "type" options shown in your example. There are several uncommitted patches still in progress to add these features. HADOOP-10578 implements "maxdepth". HADOOP-10579 implements "type". These features are not yet available in any release of either HDP or Apache Hadoop.
Until these features become generally available, I think your only other option is to use wildcard glob matching as suggested in prior answers. I understand you said that there is some variability to the names because of dates and times embedded into the names. You would need to find a way to stage these files in a predictable way, so that you can effectively use a wildcard to match only the files that you want to match. This might require renaming files or moving them into a different directory structure at time of ingest.
Another possible option could be to script it externally, such as by using bash to run an ls -R command, parse the results, and then call the Hadoop shell again using only the files that you want. However, this would introduce overhead from needing to start a separate Hadoop shell process (a JVM) for each command, which might be unacceptable.
Created 02-04-2016 10:55 AM
@Chris Nauroth This is good information. Thanks for sharing this.