Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Best practice for extract/output data (generated by Pig script) to be stored in a single file?

avatar
Rising Star

AIM: To grab a daily extract of data stored in HDFS/Hive, process it using Pig, then make results available externally as a single CSV file (automated using bash script).

OPTIONS:

1. Force output from Pig script to be stored as one file using 'PARALLEL 1' and then copy out using '-copyToLocal'

extractAlias = ORDER stuff BY something ASC;
STORE extractAlias INTO '/hdfs/output/path' USING CSVExcelStorage() PARALLEL 1;

2. Allow default parallelism during Pig STORE and use '-getmerge' when copying out extract results

hdfs dfs -getmerge '/hdfs/output/path' '/local/dest/path'

QUESTION:

Which way is more efficient/practical and why? Are there any other ways?

1 ACCEPTED SOLUTION

avatar

I believe generally hard coding parallel is a bad idea in your pig script. With Parallel 1, you are effectively having 1 reducer perform the job. This can affect scale and performance.

I would allow default parallelism and use the hdfs dfs -getmerge option. For an input point of view, Here is a tip to Combine Small files.

View solution in original post

1 REPLY 1

avatar

I believe generally hard coding parallel is a bad idea in your pig script. With Parallel 1, you are effectively having 1 reducer perform the job. This can affect scale and performance.

I would allow default parallelism and use the hdfs dfs -getmerge option. For an input point of view, Here is a tip to Combine Small files.