Best practice for extract/output data (generated by Pig script) to be stored in a single file?

emilysharpe — Thu, 10 Dec 2015 14:11:19 GMT

AIM: To grab a daily extract of data stored in HDFS/Hive, process it using Pig, then make results available externally as a single CSV file (automated using bash script).

OPTIONS:

1. Force output from Pig script to be stored as one file using 'PARALLEL 1' and then copy out using '-copyToLocal'

extractAlias = ORDER stuff BY something ASC;
STORE extractAlias INTO '/hdfs/output/path' USING CSVExcelStorage() PARALLEL 1;

2. Allow default parallelism during Pig STORE and use '-getmerge' when copying out extract results

hdfs dfs -getmerge '/hdfs/output/path' '/local/dest/path'

QUESTION:

Which way is more efficient/practical and why? Are there any other ways?

Re: Best practice for extract/output data (generated by Pig script) to be stored in a single file?

amcbarnett — Thu, 10 Dec 2015 23:08:11 GMT

I believe generally hard coding parallel is a bad idea in your pig script. With Parallel 1, you are effectively having 1 reducer perform the job. This can affect scale and performance.

I would allow default parallelism and use the hdfs dfs -getmerge option. For an input point of view, Here is a tip to Combine Small files.

question Best practice for extract/output data (generated by Pig script) to be stored in a single file? in Archives of Support Questions (Read Only)

Best practice for extract/output data (generated by Pig script) to be stored in a single file?

Re: Best practice for extract/output data (generated by Pig script) to be stored in a single file?