Support Questions

Find answers, ask questions, and share your expertise

Appending to existing partition with Pig

Contributor

Pig does not support appending to an existing partition through HCatalog.

What workarounds are there to perform the append and get a behavior similar to Hive's INSERT INTO TABLE with Pig ?

1 ACCEPTED SOLUTION

Explorer

Pig does not have any notion of appending results to an existing directory. You could write your Pig job to put results into a temporary directory and then use a DFS command (in or out of Pig) to copy the files into the existing directory that you wanted to append to. Note that this has some dangers (starting jobs may or may not see the copied files, or may only see some of them). Also note that this is exactly what Hive does in its INSERT INTO if you aren't using a lock manager.

View solution in original post

5 REPLIES 5

Explorer

Pig does not have any notion of appending results to an existing directory. You could write your Pig job to put results into a temporary directory and then use a DFS command (in or out of Pig) to copy the files into the existing directory that you wanted to append to. Note that this has some dangers (starting jobs may or may not see the copied files, or may only see some of them). Also note that this is exactly what Hive does in its INSERT INTO if you aren't using a lock manager.

Expert Contributor

In order to avoid/"reduce the risk" of the above mentioned dangers, maybe some recommendations could be:

- results should be in a temporary HDFS directory (the same HDFS volume where the target directory is)

- DFS command to move (instead of copy) the files to the target directory.

Doing so, the "move" operation should be pretty atomic and the risk of race conditions quite low.

You could also add and additional partitioning value to the end of the existing Hive table's partition definition. This could be something like an "ingestion key" which itself is arbitrary, but could also have some tie-in with something like an Oozie job id. So, if you partition at a day level, but you ingest at an hourly level then we simply have 12 subfolders underneath your "real" partition's folder.

This shouldn't cause you any troubles and you don't have to think about this additional partition definition when you query the table. The only consequence is if your hourly ingests don't build a significantly sized file, but you'd have that problem with cp/mv approach presented by @Alan Gates and @Sourygna Luangsay, too.

In that case, you can create a separate process that (explicitly) compacts/rebuilds all the data once all the adds to the "real" partition are done. @David Streever has done something like this at scale for a large telecom customer and I'd be glad to share some second-hand knowledge of how this could work for you.

New Contributor

I'm interested to hear more details of the telecom use case for the compact/rebuild process, specifically around optimizing its performance.

Explorer

I'm interested in the telecom use case, too.... We're dealing with hourly ingests that result in a number of small files we'd like regularly compacted...

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.