Created 12-16-2016 06:45 AM
Created 12-16-2016 06:50 AM
it's easier to drop a table and recreate, delete from a table and truncate are not supported by API
set hcat.bin /usr/bin/hcat; sql drop table if exists codes; sql create table codes(code string, description string, total_emp int, salary int); a = load 'sample_07' using org.apache.hive.hcatalog.pig.HCatLoader(); b = load 'sample_08' using org.apache.hive.hcatalog.pig.HCatLoader(); c = join b by code, a by code;d = foreach c generate $0 as code, $1 as description, $2 as total_emp, $3 as salary; store d into 'codes' using org.apache.hive.hcatalog.pig.HCatStorer();
Created 12-18-2016 05:44 AM
I think that's the only solution i think is the best at the moment using temporary table. Does Overwrite exists in Pig? Like Hive Overwrite table data.
Created 12-18-2016 05:11 PM
According to the wiki it exists but in our mutual tests, we were not able to fit your use case. Perhaps you'd like to open an enhancement Jira. I do like integration of Pig and HCatalog though I would argue instead of spending hours to figure it out, consider looking into Spark as it's under a heavy development and benefits from a wider community. If my workaround fits your needs, please consider accepting my answer as best.
Created 12-19-2016 08:30 PM
Yes, This is testing only and i am doing it out of my curiosity. Thanks you very much for your valuable responses.
Could you let me know how to open an enhancement Jira request ?
Created 12-19-2016 08:33 PM
@Praveen PentaReddy create an Apache Jira account and open a ticket on https://issues.apache.org/jira/browse/pig/?selectedTab=com.atlassian.jira.jira-projects-plugin:summa... apache pig jira page.
Created 12-19-2016 08:58 PM
Thanks, i have raised an enhancement Jira.
Created 12-20-2016 12:44 AM
Can you paste a link to your Jira and accept my answer as workaround. Thanks
Created 12-29-2016 09:00 PM
to close the loop on this, turns out append is the default behavior and if you read the comments in https://issues.apache.org/jira/browse/HIVE-6897 you can see that it is not advisable to force an overwrite of a table via HCatalog. So to turn the feature off completely and not promote "bad" behavior I did the following
grunt> sql alter table codeZ set TBLPROPERTIES ('immutable' = 'true'); 2016-12-29 20:49:51,924 [main] INFO org.apache.pig.tools.grunt.GruntParser - Going to run hcat command: alter table codeZ set TBLPROPERTIES ('immutable' = 'true'); OK Time taken: 2.041 seconds grunt> a = load 'sample_07' using org.apache.hive.hcatalog.pig.HCatLoader(); 2016-12-29 20:51:00,125 [main] INFO hive.metastore - Trying to connect to metastore with URI thrift://sandbox.hortonworks.com:9083 2016-12-29 20:51:00,163 [main] INFO hive.metastore - Connected to metastore. grunt> b = load 'sample_08' using org.apache.hive.hcatalog.pig.HCatLoader(); grunt> c = join b by code, a by code; grunt> d = foreach c generate $0 as code, $1 as description, $2 as total_emp, $3 as salary; grunt> store d into 'codeZ' using org.apache.hive.hcatalog.pig.HCatStorer(); 2016-12-29 20:52:26,894 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000: <line 5, column 0> Output Location Validation Failed for: 'codeZ More info to follow: org.apache.hive.hcatalog.common.HCatException : 2003 : Non-partitioned table already contains data : default.codez Details at logfile: /home/guest/pig_1483044536026.log grunt> quit 2016-12-29 20:56:25,336 [main] INFO org.apache.pig.Main - Pig script completed in 7 minutes, 29 seconds and 397 milliseconds (449397 ms) [guest@sandbox ~]$ less /home/guest/pig_1483044536026.log
and snippet from log
ERROR 6000: <line 5, column 0> Output Location Validation Failed for: 'codeZ More info to follow: org.apache.hive.hcatalog.common.HCatException : 2003 : Non-partitioned table already contains data : default.codez org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias d at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1778) at org.apache.pig.PigServer.registerQuery(PigServer.java:707) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1075) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66) at org.apache.pig.Main.run(Main.java:566) at org.apache.pig.Main.main(Main.java:178) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:233) at org.apache.hadoop.util.RunJar.main(RunJar.java:148) Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 6000: <line 5, column 0> Output Location Validation Failed for: 'codeZ More info to follow: org.apache.hive.hcatalog.common.HCatException : 2003 : Non-partitioned table already contains data : default.codez at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:95) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212) at org.apache.pig.PigServer$Graph.compile(PigServer.java:1851) at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1527) at org.apache.pig.PigServer.execute(PigServer.java:1440) at org.apache.pig.PigServer.access$500(PigServer.java:118) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1773) ... 14 more Caused by: org.apache.hive.hcatalog.common.HCatException : 2003 : Non-partitioned table already contains data : default.codez at org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.handleDuplicatePublish(FileOutputFormatContainer.java:206) at org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.checkOutputSpecs(FileOutputFormatContainer.java:121) at org.apache.hive.hcatalog.mapreduce.HCatBaseOutputFormat.checkOutputSpecs(HCatBaseOutputFormat.java:65) at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:69)
in other words, it is not advisable to overwrite via HCatStorer as Hive handles append/overwrite. The only workaround here is to use a temporary table as I suggested earlier.
Created 12-16-2016 10:19 PM
Any Help highly appreciated.
Created 12-20-2016 06:28 PM
Created 12-20-2016 08:10 PM
Thanks for submitting the JIRA ticket. I reviewed Artem's response and it seems that until the enhancement is implemented, his response is still the best option as of right now.