- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Best practice for extract/output data (generated by Pig script) to be stored in a single file?
- Labels:
-
Apache Hadoop
-
Apache Pig
Created ‎12-10-2015 06:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AIM: To grab a daily extract of data stored in HDFS/Hive, process it using Pig, then make results available externally as a single CSV file (automated using bash script).
OPTIONS:
1. Force output from Pig script to be stored as one file using 'PARALLEL 1' and then copy out using '-copyToLocal'
extractAlias = ORDER stuff BY something ASC; STORE extractAlias INTO '/hdfs/output/path' USING CSVExcelStorage() PARALLEL 1;
2. Allow default parallelism during Pig STORE and use '-getmerge' when copying out extract results
hdfs dfs -getmerge '/hdfs/output/path' '/local/dest/path'
QUESTION:
Which way is more efficient/practical and why? Are there any other ways?
Created ‎12-10-2015 03:08 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe generally hard coding parallel is a bad idea in your pig script. With Parallel 1, you are effectively having 1 reducer perform the job. This can affect scale and performance.
I would allow default parallelism and use the hdfs dfs -getmerge option. For an input point of view, Here is a tip to Combine Small files.
Created ‎12-10-2015 03:08 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe generally hard coding parallel is a bad idea in your pig script. With Parallel 1, you are effectively having 1 reducer perform the job. This can affect scale and performance.
I would allow default parallelism and use the hdfs dfs -getmerge option. For an input point of view, Here is a tip to Combine Small files.
