Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Appending to existing partition with Pig

Solved Go to solution
Highlighted

Appending to existing partition with Pig

Contributor

Pig does not support appending to an existing partition through HCatalog.

What workarounds are there to perform the append and get a behavior similar to Hive's INSERT INTO TABLE with Pig ?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Appending to existing partition with Pig

New Contributor

Pig does not have any notion of appending results to an existing directory. You could write your Pig job to put results into a temporary directory and then use a DFS command (in or out of Pig) to copy the files into the existing directory that you wanted to append to. Note that this has some dangers (starting jobs may or may not see the copied files, or may only see some of them). Also note that this is exactly what Hive does in its INSERT INTO if you aren't using a lock manager.

5 REPLIES 5

Re: Appending to existing partition with Pig

New Contributor

Pig does not have any notion of appending results to an existing directory. You could write your Pig job to put results into a temporary directory and then use a DFS command (in or out of Pig) to copy the files into the existing directory that you wanted to append to. Note that this has some dangers (starting jobs may or may not see the copied files, or may only see some of them). Also note that this is exactly what Hive does in its INSERT INTO if you aren't using a lock manager.

Re: Appending to existing partition with Pig

Expert Contributor

In order to avoid/"reduce the risk" of the above mentioned dangers, maybe some recommendations could be:

- results should be in a temporary HDFS directory (the same HDFS volume where the target directory is)

- DFS command to move (instead of copy) the files to the target directory.

Doing so, the "move" operation should be pretty atomic and the risk of race conditions quite low.

Re: Appending to existing partition with Pig

You could also add and additional partitioning value to the end of the existing Hive table's partition definition. This could be something like an "ingestion key" which itself is arbitrary, but could also have some tie-in with something like an Oozie job id. So, if you partition at a day level, but you ingest at an hourly level then we simply have 12 subfolders underneath your "real" partition's folder.

This shouldn't cause you any troubles and you don't have to think about this additional partition definition when you query the table. The only consequence is if your hourly ingests don't build a significantly sized file, but you'd have that problem with cp/mv approach presented by @Alan Gates and @Sourygna Luangsay, too.

In that case, you can create a separate process that (explicitly) compacts/rebuilds all the data once all the adds to the "real" partition are done. @David Streever has done something like this at scale for a large telecom customer and I'd be glad to share some second-hand knowledge of how this could work for you.

Re: Appending to existing partition with Pig

New Contributor

I'm interested to hear more details of the telecom use case for the compact/rebuild process, specifically around optimizing its performance.

Re: Appending to existing partition with Pig

New Contributor

I'm interested in the telecom use case, too.... We're dealing with hourly ingests that result in a number of small files we'd like regularly compacted...

Don't have an account?
Coming from Hortonworks? Activate your account here