Support Questions

Find answers, ask questions, and share your expertise

HCatStorer is not overwriting a Hive table. or How to overwrite a hive table using HCatStorer ?

avatar
Expert Contributor
 
1 ACCEPTED SOLUTION

avatar
Master Mentor

it's easier to drop a table and recreate, delete from a table and truncate are not supported by API

set hcat.bin /usr/bin/hcat;
sql drop table if exists codes;
sql create table codes(code string, description string, total_emp int, salary int);
a = load 'sample_07' using org.apache.hive.hcatalog.pig.HCatLoader();
b = load 'sample_08' using org.apache.hive.hcatalog.pig.HCatLoader();
c = join b by code, a by code;d = foreach c generate $0 as code, $1 as description, $2 as total_emp, $3 as salary;
store d into 'codes' using org.apache.hive.hcatalog.pig.HCatStorer();

View solution in original post

19 REPLIES 19

avatar
Expert Contributor

I think that's the only solution i think is the best at the moment using temporary table. Does Overwrite exists in Pig? Like Hive Overwrite table data.

avatar
Master Mentor

According to the wiki it exists but in our mutual tests, we were not able to fit your use case. Perhaps you'd like to open an enhancement Jira. I do like integration of Pig and HCatalog though I would argue instead of spending hours to figure it out, consider looking into Spark as it's under a heavy development and benefits from a wider community. If my workaround fits your needs, please consider accepting my answer as best.

avatar
Expert Contributor

Yes, This is testing only and i am doing it out of my curiosity. Thanks you very much for your valuable responses.

Could you let me know how to open an enhancement Jira request ?

avatar
Master Mentor

avatar
Expert Contributor

Thanks, i have raised an enhancement Jira.

avatar
Master Mentor

Can you paste a link to your Jira and accept my answer as workaround. Thanks

avatar
Master Mentor

@Praveen PentaReddy

to close the loop on this, turns out append is the default behavior and if you read the comments in https://issues.apache.org/jira/browse/HIVE-6897 you can see that it is not advisable to force an overwrite of a table via HCatalog. So to turn the feature off completely and not promote "bad" behavior I did the following

grunt> sql alter table codeZ set TBLPROPERTIES ('immutable' = 'true');
2016-12-29 20:49:51,924 [main] INFO  org.apache.pig.tools.grunt.GruntParser - Going to run hcat command: alter table codeZ set TBLPROPERTIES ('immutable' = 'true');
OK
Time taken: 2.041 seconds
grunt> a = load 'sample_07' using org.apache.hive.hcatalog.pig.HCatLoader();
2016-12-29 20:51:00,125 [main] INFO  hive.metastore - Trying to connect to metastore with URI thrift://sandbox.hortonworks.com:9083
2016-12-29 20:51:00,163 [main] INFO  hive.metastore - Connected to metastore.
grunt> b = load 'sample_08' using org.apache.hive.hcatalog.pig.HCatLoader();
grunt> c = join b by code, a by code;
grunt> d = foreach c generate $0 as code, $1 as description, $2 as total_emp, $3 as salary;
grunt> store d into 'codeZ' using org.apache.hive.hcatalog.pig.HCatStorer();
2016-12-29 20:52:26,894 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6000:
<line 5, column 0> Output Location Validation Failed for: 'codeZ More info to follow:
org.apache.hive.hcatalog.common.HCatException : 2003 : Non-partitioned table already contains data : default.codez
Details at logfile: /home/guest/pig_1483044536026.log
grunt> quit
2016-12-29 20:56:25,336 [main] INFO  org.apache.pig.Main - Pig script completed in 7 minutes, 29 seconds and 397 milliseconds (449397 ms)
[guest@sandbox ~]$ less /home/guest/pig_1483044536026.log

and snippet from log

ERROR 6000:
<line 5, column 0> Output Location Validation Failed for: 'codeZ More info to follow:
org.apache.hive.hcatalog.common.HCatException : 2003 : Non-partitioned table already contains data : default.codez


org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias d
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1778)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:707)
        at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1075)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:505)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
        at org.apache.pig.Main.run(Main.java:566)
        at org.apache.pig.Main.main(Main.java:178)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 6000:
<line 5, column 0> Output Location Validation Failed for: 'codeZ More info to follow:
org.apache.hive.hcatalog.common.HCatException : 2003 : Non-partitioned table already contains data : default.codez
        at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:95)
        at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
        at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
        at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
        at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
        at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
        at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
        at org.apache.pig.newplan.logical.relational.LogicalPlan.validate(LogicalPlan.java:212)
        at org.apache.pig.PigServer$Graph.compile(PigServer.java:1851)
        at org.apache.pig.PigServer$Graph.access$300(PigServer.java:1527)
        at org.apache.pig.PigServer.execute(PigServer.java:1440)
        at org.apache.pig.PigServer.access$500(PigServer.java:118)
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1773)
        ... 14 more
Caused by: org.apache.hive.hcatalog.common.HCatException : 2003 : Non-partitioned table already contains data : default.codez
        at org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.handleDuplicatePublish(FileOutputFormatContainer.java:206)
        at org.apache.hive.hcatalog.mapreduce.FileOutputFormatContainer.checkOutputSpecs(FileOutputFormatContainer.java:121)
        at org.apache.hive.hcatalog.mapreduce.HCatBaseOutputFormat.checkOutputSpecs(HCatBaseOutputFormat.java:65)
        at org.apache.pig.newplan.logical.visitor.InputOutputFileValidatorVisitor.visit(InputOutputFileValidatorVisitor.java:69)

in other words, it is not advisable to overwrite via HCatStorer as Hive handles append/overwrite. The only workaround here is to use a temporary table as I suggested earlier.

avatar
Expert Contributor

Any Help highly appreciated.

avatar
Expert Contributor

avatar
Super Guru

@Praveen PentaReddy

Thanks for submitting the JIRA ticket. I reviewed Artem's response and it seems that until the enhancement is implemented, his response is still the best option as of right now.