Support Questions

emilysharpe · ‎12-10-2015

AIM: To grab a daily extract of data stored in HDFS/Hive, process it using Pig, then make results available externally as a single CSV file (automated using bash script).

OPTIONS:

1. Force output from Pig script to be stored as one file using 'PARALLEL 1' and then copy out using '-copyToLocal'

extractAlias = ORDER stuff BY something ASC;
STORE extractAlias INTO '/hdfs/output/path' USING CSVExcelStorage() PARALLEL 1;

2. Allow default parallelism during Pig STORE and use '-getmerge' when copying out extract results

hdfs dfs -getmerge '/hdfs/output/path' '/local/dest/path'

QUESTION:

Which way is more efficient/practical and why? Are there any other ways?

amcbarnett · ‎12-10-2015

I believe generally hard coding parallel is a bad idea in your pig script. With Parallel 1, you are effectively having 1 reducer perform the job. This can affect scale and performance.

I would allow default parallelism and use the hdfs dfs -getmerge option. For an input point of view, Here is a tip to Combine Small files.

View solution in original post

amcbarnett · ‎12-10-2015

I believe generally hard coding parallel is a bad idea in your pig script. With Parallel 1, you are effectively having 1 reducer perform the job. This can affect scale and performance.

I would allow default parallelism and use the hdfs dfs -getmerge option. For an input point of view, Here is a tip to Combine Small files.

Cloudera Community

Support Questions

Best practice for extract/output data (generated by Pig script) to be stored in a single file?

MergeRecord generates multiple files

Best Practices: Linux File Systems for HDFS

Running Apache Pig Scripts from Apache NiFi and St...

Pig Doing Yoga: How to Build Superflexible Pig Scr...

store multiple files using pig

Store output file as 3 files using pig

Choosing the right place to store data within the ...

Store excel files in postgresql via NiFi groovy sc...

Using Pig to convert uncompressed data to compress...

Loading data into HBase using Pig script