Created 08-25-2016 06:50 PM
Hi experts, I've multiples files (parquet files) in a directory of the HDFS and I want to join all the files into one file using Apache PIG. I don't know how many files I will have into this directory so I can't declare a variable for each file. There is a way to identify all the files in the same directory and with the same schema?
Thanks!
Regards!!!
Created 08-26-2016 12:41 PM
Just use the directory name in your LOAD statement.
Created 09-01-2016 01:27 PM
For unioning all files in one directory -- same answer as @Lester Martin.
You can use globs (wildcard characters) in your LOAD path to pull a subset of files from a directory, based on the filename pattern. See http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#pl_load.
For example you could LOAD the path
'parentDir/myFile_*'
to load only files beginning with the name myFile_.
You suggested in your final sentence that files in same directory may have different schema. If this is the case and files with the same schema have similar names, you can use the globs in your filenames as shown above to pull only same-schema files from the directory.