Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

how to manage modified data in Apache Hive

how to manage modified data in Apache Hive

New Contributor

We are working on Cloudera CDH and trying to perform reporting on the data stored on Apache Hadoop. We send daily reports to client so need to import data from operational store to hadoop daily.

Hadoop works on the append only mode. Hence we can not perform the Hive update/delete query. We can perform Insert overwrite on dimension tables and add delta values in the fact tables. Introducing thousands for the delta rows daily does not seem quite impressive solution.

Are there any other standard better ways to update modified data in Hadoop?

Thanks

1 REPLY 1
Highlighted

Re: how to manage modified data in Apache Hive

Master Guru
Are your queries on the table mostly limited to certain partitions/sections of key? Have you considered using HBase for this, if so? For added performance, you can run queries over snapshotted HBase data [1]

- See release notes on https://issues.apache.org/jira/browse/HIVE-6584, sub-quoted:
"""
Hive can now execute queries against HBase table snapshots. This feature is available for any table defined using the HBaseStorageHandler. It requires at least HBase 0.98.3.

To query against a snapshot instead of the online table, specify the snapshot name via hive.hbase.snapshot.name. The snapshot will be restored into a unique directory under /tmp. This location can be overridden by setting a path via hive.hbase.snapshot.restoredir.
"""