Created 10-30-2015 11:26 PM
Pig does not support appending to an existing partition through HCatalog.
What workarounds are there to perform the append and get a behavior similar to Hive's INSERT INTO TABLE with Pig ?
Created 11-16-2015 08:06 PM
Pig does not have any notion of appending results to an existing directory. You could write your Pig job to put results into a temporary directory and then use a DFS command (in or out of Pig) to copy the files into the existing directory that you wanted to append to. Note that this has some dangers (starting jobs may or may not see the copied files, or may only see some of them). Also note that this is exactly what Hive does in its INSERT INTO if you aren't using a lock manager.
Created 11-16-2015 08:06 PM
Pig does not have any notion of appending results to an existing directory. You could write your Pig job to put results into a temporary directory and then use a DFS command (in or out of Pig) to copy the files into the existing directory that you wanted to append to. Note that this has some dangers (starting jobs may or may not see the copied files, or may only see some of them). Also note that this is exactly what Hive does in its INSERT INTO if you aren't using a lock manager.
Created 11-16-2015 08:32 PM
In order to avoid/"reduce the risk" of the above mentioned dangers, maybe some recommendations could be:
- results should be in a temporary HDFS directory (the same HDFS volume where the target directory is)
- DFS command to move (instead of copy) the files to the target directory.
Doing so, the "move" operation should be pretty atomic and the risk of race conditions quite low.
Created 01-20-2016 10:50 PM
You could also add and additional partitioning value to the end of the existing Hive table's partition definition. This could be something like an "ingestion key" which itself is arbitrary, but could also have some tie-in with something like an Oozie job id. So, if you partition at a day level, but you ingest at an hourly level then we simply have 12 subfolders underneath your "real" partition's folder.
This shouldn't cause you any troubles and you don't have to think about this additional partition definition when you query the table. The only consequence is if your hourly ingests don't build a significantly sized file, but you'd have that problem with cp/mv approach presented by @Alan Gates and @Sourygna Luangsay, too.
In that case, you can create a separate process that (explicitly) compacts/rebuilds all the data once all the adds to the "real" partition are done. @David Streever has done something like this at scale for a large telecom customer and I'd be glad to share some second-hand knowledge of how this could work for you.
Created 08-08-2016 06:13 PM
I'm interested to hear more details of the telecom use case for the compact/rebuild process, specifically around optimizing its performance.
Created 09-01-2016 05:08 PM
I'm interested in the telecom use case, too.... We're dealing with hourly ingests that result in a number of small files we'd like regularly compacted...