Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

how to increment the hive external table?

avatar
Rising Star

(1)I have created hive external table and data coming from netezza to hdfs

(2) everyday we have to incremental apend to this table as well as if any data is changed to base table then what to do?

I can apend as much I want but if any changed happen on base table for example few raw changed or few column changed then how I can increment works for me?

everyday I have to take same table from netezza and hdfs. most probably we can apend with data_date.

1 ACCEPTED SOLUTION

avatar
@mike pal

The below link should cover your requirements. It shows a strategy for incremental updates/ingest. It also covers the scenario where base data may change:

http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

View solution in original post

3 REPLIES 3

avatar
@mike pal

The below link should cover your requirements. It shows a strategy for incremental updates/ingest. It also covers the scenario where base data may change:

http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

avatar
Master Guru

1) New data is added

You can import data using Sqoop or Netezza loading unloading functions Sqoop provides delta loading by timestamp or id column ( any column that increments continuously

2) Old data is changed

Bigger problem, Hive has transactions but it is still very new.

2.1 Changed small dimension tables

A good approach is to just reload them. As long as they fall under a couple GB and you have a nightly period to do it.

2.2 Changes to big fact tables

Bigger problem.

- You can use Hive ACID transactions but as said they are still new

- Alternatively you would have to use a manual approach like adding a version column to your table and run your queries in a way that they use the newest one.

- Last possibility is to load the delta changes and then merge them into the existing table in HAdoop. While loading TB of data into a hadoop cluster can be a bottleneck re creating a table like that by joining old with new data is very fast since it is running in parallel in the cluster.

avatar
Rising Star

Thanks a lot