Support Questions
Find answers, ask questions, and share your expertise

Apache PIG - Join all the files from a specific directory

Highlighted

Apache PIG - Join all the files from a specific directory

Explorer

Hi experts, I've multiples files (parquet files) in a directory of the HDFS and I want to join all the files into one file using Apache PIG. I don't know how many files I will have into this directory so I can't declare a variable for each file. There is a way to identify all the files in the same directory and with the same schema?

Thanks!

Regards!!!

2 REPLIES 2
Highlighted

Re: Apache PIG - Join all the files from a specific directory

Just use the directory name in your LOAD statement.

Re: Apache PIG - Join all the files from a specific directory

Guru

For unioning all files in one directory -- same answer as @Lester Martin.

You can use globs (wildcard characters) in your LOAD path to pull a subset of files from a directory, based on the filename pattern. See http://chimera.labs.oreilly.com/books/1234000001811/ch05.html#pl_load.

For example you could LOAD the path

'parentDir/myFile_*' 

to load only files beginning with the name myFile_.

You suggested in your final sentence that files in same directory may have different schema. If this is the case and files with the same schema have similar names, you can use the globs in your filenames as shown above to pull only same-schema files from the directory.