Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Appending to existing partition with Pig

avatar
Rising Star

Pig does not support appending to an existing partition through HCatalog.

What workarounds are there to perform the append and get a behavior similar to Hive's INSERT INTO TABLE with Pig ?

1 ACCEPTED SOLUTION

avatar
Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
5 REPLIES 5

avatar
Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Super Collaborator

In order to avoid/"reduce the risk" of the above mentioned dangers, maybe some recommendations could be:

- results should be in a temporary HDFS directory (the same HDFS volume where the target directory is)

- DFS command to move (instead of copy) the files to the target directory.

Doing so, the "move" operation should be pretty atomic and the risk of race conditions quite low.

avatar

You could also add and additional partitioning value to the end of the existing Hive table's partition definition. This could be something like an "ingestion key" which itself is arbitrary, but could also have some tie-in with something like an Oozie job id. So, if you partition at a day level, but you ingest at an hourly level then we simply have 12 subfolders underneath your "real" partition's folder.

This shouldn't cause you any troubles and you don't have to think about this additional partition definition when you query the table. The only consequence is if your hourly ingests don't build a significantly sized file, but you'd have that problem with cp/mv approach presented by @Alan Gates and @Sourygna Luangsay, too.

In that case, you can create a separate process that (explicitly) compacts/rebuilds all the data once all the adds to the "real" partition are done. @David Streever has done something like this at scale for a large telecom customer and I'd be glad to share some second-hand knowledge of how this could work for you.

avatar
New Contributor

I'm interested to hear more details of the telecom use case for the compact/rebuild process, specifically around optimizing its performance.

avatar
Rising Star

I'm interested in the telecom use case, too.... We're dealing with hourly ingests that result in a number of small files we'd like regularly compacted...