Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Appending to existing partition with Pig

avatar
Rising Star

Pig does not support appending to an existing partition through HCatalog.

What workarounds are there to perform the append and get a behavior similar to Hive's INSERT INTO TABLE with Pig ?

1 ACCEPTED SOLUTION

avatar
Contributor

Pig does not have any notion of appending results to an existing directory. You could write your Pig job to put results into a temporary directory and then use a DFS command (in or out of Pig) to copy the files into the existing directory that you wanted to append to. Note that this has some dangers (starting jobs may or may not see the copied files, or may only see some of them). Also note that this is exactly what Hive does in its INSERT INTO if you aren't using a lock manager.

View solution in original post

5 REPLIES 5

avatar
Contributor

Pig does not have any notion of appending results to an existing directory. You could write your Pig job to put results into a temporary directory and then use a DFS command (in or out of Pig) to copy the files into the existing directory that you wanted to append to. Note that this has some dangers (starting jobs may or may not see the copied files, or may only see some of them). Also note that this is exactly what Hive does in its INSERT INTO if you aren't using a lock manager.

avatar
Super Collaborator

In order to avoid/"reduce the risk" of the above mentioned dangers, maybe some recommendations could be:

- results should be in a temporary HDFS directory (the same HDFS volume where the target directory is)

- DFS command to move (instead of copy) the files to the target directory.

Doing so, the "move" operation should be pretty atomic and the risk of race conditions quite low.

avatar

You could also add and additional partitioning value to the end of the existing Hive table's partition definition. This could be something like an "ingestion key" which itself is arbitrary, but could also have some tie-in with something like an Oozie job id. So, if you partition at a day level, but you ingest at an hourly level then we simply have 12 subfolders underneath your "real" partition's folder.

This shouldn't cause you any troubles and you don't have to think about this additional partition definition when you query the table. The only consequence is if your hourly ingests don't build a significantly sized file, but you'd have that problem with cp/mv approach presented by @Alan Gates and @Sourygna Luangsay, too.

In that case, you can create a separate process that (explicitly) compacts/rebuilds all the data once all the adds to the "real" partition are done. @David Streever has done something like this at scale for a large telecom customer and I'd be glad to share some second-hand knowledge of how this could work for you.

avatar
New Contributor

I'm interested to hear more details of the telecom use case for the compact/rebuild process, specifically around optimizing its performance.

avatar
Rising Star

I'm interested in the telecom use case, too.... We're dealing with hourly ingests that result in a number of small files we'd like regularly compacted...