AIM: To grab a daily extract of data stored in HDFS/Hive, process it using Pig, then make results available externally as a single CSV file (automated using bash script).
OPTIONS:
1. Force output from Pig script to be stored as one file using 'PARALLEL 1' and then copy out using '-copyToLocal'
extractAlias = ORDER stuff BY something ASC;
STORE extractAlias INTO '/hdfs/output/path' USING CSVExcelStorage() PARALLEL 1;
2. Allow default parallelism during Pig STORE and use '-getmerge' when copying out extract results
hdfs dfs -getmerge '/hdfs/output/path' '/local/dest/path'
QUESTION:
Which way is more efficient/practical and why? Are there any other ways?