Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Apache PIG - Join all the files from a specific directory

Apache PIG - Join all the files from a specific directory

Explorer

Hi experts, I've multiples files (parquet files) in a directory of the HDFS and I want to join all the files into one file using Apache PIG. I don't know how many files I will have into this directory so I can't declare a variable for each file. There is a way to identify all the files in the same directory and with the same schema?

Thanks!

Regards!!!

2 REPLIES 2
Highlighted

Re: Apache PIG - Join all the files from a specific directory

Just use the directory name in your LOAD statement.

Highlighted

Re: Apache PIG - Join all the files from a specific directory

Guru

For unioning all files in one directory -- same answer as @Lester Martin.

You can use globs (wildcard characters) in your LOAD path to pull a subset of files from a directory, based on the filename pattern. See http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#pl_load.

For example you could LOAD the path

'parentDir/myFile_*' 

to load only files beginning with the name myFile_.

You suggested in your final sentence that files in same directory may have different schema. If this is the case and files with the same schema have similar names, you can use the globs in your filenames as shown above to pull only same-schema files from the directory.

Don't have an account?
Coming from Hortonworks? Activate your account here