Archives of Support Questions (Read Only)

emilysharpe · ‎12-10-2015

AIM: To grab a daily extract of data stored in HDFS/Hive, process it using Pig, then make results available externally as a single CSV file (automated using bash script).

OPTIONS:

1. Force output from Pig script to be stored as one file using 'PARALLEL 1' and then copy out using '-copyToLocal'

extractAlias = ORDER stuff BY something ASC;
STORE extractAlias INTO '/hdfs/output/path' USING CSVExcelStorage() PARALLEL 1;

2. Allow default parallelism during Pig STORE and use '-getmerge' when copying out extract results

hdfs dfs -getmerge '/hdfs/output/path' '/local/dest/path'

QUESTION:

Which way is more efficient/practical and why? Are there any other ways?

amcbarnett · ‎12-10-2015

I believe generally hard coding parallel is a bad idea in your pig script. With Parallel 1, you are effectively having 1 reducer perform the job. This can affect scale and performance.

I would allow default parallelism and use the hdfs dfs -getmerge option. For an input point of view, Here is a tip to Combine Small files.

View solution in original post

amcbarnett · ‎12-10-2015

I believe generally hard coding parallel is a bad idea in your pig script. With Parallel 1, you are effectively having 1 reducer perform the job. This can affect scale and performance.

I would allow default parallelism and use the hdfs dfs -getmerge option. For an input point of view, Here is a tip to Combine Small files.

Cloudera Community

Archives of Support Questions (Read Only)

Best practice for extract/output data (generated by Pig script) to be stored in a single file?