Support Questions
Find answers, ask questions, and share your expertise
Alert: The Cloudera Community will undergo maintenance on Saturday, August 17 at 12:00am PDT. See more info here.

how to manage modified data in Apache Hive

how to manage modified data in Apache Hive

New Contributor

We are working on Cloudera CDH and trying to perform reporting on the data stored on Apache Hadoop. We send daily reports to client so need to import data from operational store to hadoop daily.

Hadoop works on the append only mode. Hence we can not perform the Hive update/delete query. We can perform Insert overwrite on dimension tables and add delta values in the fact tables. Introducing thousands for the delta rows daily does not seem quite impressive solution.

Are there any other standard better ways to update modified data in Hadoop?



Re: how to manage modified data in Apache Hive

Cloudera Employee
Are your queries on the table mostly limited to certain partitions/sections of key? Have you considered using HBase for this, if so? For added performance, you can run queries over snapshotted HBase data [1]

- See release notes on, sub-quoted:
Hive can now execute queries against HBase table snapshots. This feature is available for any table defined using the HBaseStorageHandler. It requires at least HBase 0.98.3.

To query against a snapshot instead of the online table, specify the snapshot name via The snapshot will be restored into a unique directory under /tmp. This location can be overridden by setting a path via hive.hbase.snapshot.restoredir.