I read that Cloudera does not support ACID operations in hive and anyway that's only supported with ORC format.
I import data from an API using updated_at field. Now, I do not want my table to have multiple entries with the same id. Also, data for a partcular ID could be updated very frequently so it would result in a lot of duplication. I can access hive from R using Rhive but updates and delete ooperations do not seem to work.
I tried this in the query editor:
set hive.auto.convert.join.noconditionaltask.size = 10000000; set hive.support.concurrency = true; set hive.enforce.bucketing = true; set hive.exec.dynamic.partition.mode = nonstrict; set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on = true; set hive.compactor.worker.threads = 1 ; DELETE FROM employee WHERE eid in (1, 4, 5)
It fails with this:
Error while compiling statement: FAILED: SemanticException [Error 10294]: Attempt to do update or delete using transaction manager that does not support these operations
The plan was to delete all the rows already there in hive which were imported as the updated rows from the API.
What alternatives do I have? I cannot afford that level of duplication for every id so I need to find a way to delete or update data in hive.
`INSERT OVERWRITE in PARTITIONS` can't help since I import only updated data. Can't overwrite the entire partition with it.
I can use exact same queries through R in hive that work in hive **
hive is not sutiable to do insert,update, delete, even currently hive already support insert , but it's not a good idea to use it in production env since it's very slow.
there are two solution for you to do that:
1. insert (i don't suggest this solution)
table construct like : id , name
record like: 1, "yourname"
if you want to change column name to "hisname", you can do below;
alter able xxx add column record-index int;
insert into table values(1,'hisname',2)\
then your record will be like 1, "yourname" 1 1,"histname",2
select * from table where record-index = (select max(record-indx) from table where id=1
hope you understand the above solution.
don't put your data into hive directly, use hase, then you can delete,update,insert , this is a easy way if you have requirement to change hive records.
Add below properties to Hive Client Advanced Configuration Snippet (Safety Valve) for hive-site.xml. Gateway Default Group.
I also tried to add it another places but it didnot work for me.
1. Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml
Hive (Service-Wide) --- Didn't work
2. Hive Metastore Server Advanced Configuration Snippet (Safety Valve) for hive-site.xml
Hive Metastore Server Default Group --- Didn't work.
I found update & delete were firing map-reduce statement so quite slow. One row update & delete took 21 seconds.