AIM: To grab a daily extract of data stored in HDFS/Hive, process it using Pig, then make results available externally as a single CSV file (automated using bash script).
1. Force output from Pig script to be stored as one file using 'PARALLEL 1' and then copy out using '-copyToLocal'
extractAlias = ORDER stuff BY something ASC;
STORE extractAlias INTO '/hdfs/output/path' USING CSVExcelStorage() PARALLEL 1;
2. Allow default parallelism during Pig STORE and use '-getmerge' when copying out extract results
hdfs dfs -getmerge '/hdfs/output/path' '/local/dest/path'
Which way is more efficient/practical and why? Are there any other ways?
I believe generally hard coding parallel is a bad idea in your pig script. With Parallel 1, you are effectively having 1 reducer perform the job. This can affect scale and performance.
I would allow default parallelism and use the hdfs dfs -getmerge option.
For an input point of view, Here is a tip to Combine Small files.
View solution in original post