Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

SQOOP CDC Jobs with weekly full refreshes

avatar
Explorer

So I'm in a bit of a quandary here. I just assumed ownership of the administration and implementation of our Hadoop environment. We are utilizing Zaloni Bedrock to manage the ingests, but sqoop still does the heavy lifting. One thing that I found quickly is that we didn't have our CDC jobs configured in a way that would properly capture all data. They had been built on a date/time field which seemed to have a large number of either empty instances or instances which occurred in the future (as far as 1/1/3000). With that being said, none of the updated data was ever grabbed, because nothing was newer than 1/1/3000 (and even excluding dates over 3 days returned very little plus this made scripting it in Zaloni almost impossible). It was determined our best bet would be to utilize an ID field which is auto-generated by MS-SQL in the source database, however CDC jobs would only grab new records, not updates. The tables in question are far too large to try and run a full reload daily. With that in mind, what my data scientists want is for me to configure the table to fully refresh weekly (on a Sunday) and then run CDCs throughout the week.

My question is how do I go about accomplishing this with sqoop and ensure that the CDC fields are correctly defined in the Hive overlay every time the job runs? Zaloni support hasn't been of much use, and admittedly I am very new to Hadoop.

1 ACCEPTED SOLUTION

avatar
Master Guru

Basically syncing huge fact tables that are also updated is a pain in the neck.

1) Do it as you suggested and fully reload your tables every week and then run sqoop jobs during the week based on the Incremented ID. The problem is that in this case you do not get updates during the week. If that is possible you can just sqoop into your daily table. Do you really get updates? In Warehouse environments you normally have a FACT table that is not updated and dimension tables that you can indeed reload completely. If you do not you can just continue using sqoop as before using the increment field instead of the date. ( You just need to fix the old data )

2) If you want to get updates with Sqoop during the week you will need something like a last updated at date. If you have that you can look at the approach pbalasundaram wrote about. But I personally don't like this too much since this view does a lot of processing and will make queries slower. If you can recreate the table every night based on the query from the article you should do it. However you need a short outage for this. ( The good thing is that Hadoop is pretty good at writing multi-terabyte data once it is in the cluster so you might be able to get it done at night and do a quick rename operation to update it.

3) Tools like GoldenGate/IBM CDC are definitely an option as well. They monitor the transaction log of the source database and can for example insert into Kafka/Hbase. Even slow speeds can sum up to big volumes for continuous tasks. The problem here is not the speed of these tools but Hive updates which are still very new and mostly usable for streaming inserts. So if you do not want to switch to something like Apache Phoenix as your data store ( which is ok for small aggregation queries with millions of rows but definitely not for fully aggregating a tb table ) you would need to use CDC into Kafka and then write your own storm/spark-streaming etc. app that takes the values from kafka and pushes them into Hive.

However as mentioned Hive ACID is currently very young and mostly good for streaming inserts. Inserting the new data might work well and updating some old values may work as well but a huge amount of updates across a large timerange would ask for trouble. ACID tables also still have some limitations but hopefully they will be much more stable in the near future then this should be a valid option.

http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/

View solution in original post

2 REPLIES 2

avatar
Expert Contributor

Hi,

If the tables are large size (Say multi Terabyte) then managing the table ingest through Sqoop / Partitioned Hive Tables is the best option from a performance stand point.

Though there are CDC tools like Oracle GoldenGate , which writes to HBASE and handles frequent updates in near-realtime, the maximum number of regions per region server in Hbase will grow very rapidly when there are large tables.

The maximum Transactions per second achievable is only around 10000 TPS for the Near-Realtime CDC repliciation processes. In case of a CDC failure for a few days, these new record changes need to be applied and system needs to catch up.

Please check the four stage incremental update strategy for Hive for large table updates as documented in the following link. This process merges existing data from the tables to the new/changed data from sources.

http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

avatar
Master Guru

Basically syncing huge fact tables that are also updated is a pain in the neck.

1) Do it as you suggested and fully reload your tables every week and then run sqoop jobs during the week based on the Incremented ID. The problem is that in this case you do not get updates during the week. If that is possible you can just sqoop into your daily table. Do you really get updates? In Warehouse environments you normally have a FACT table that is not updated and dimension tables that you can indeed reload completely. If you do not you can just continue using sqoop as before using the increment field instead of the date. ( You just need to fix the old data )

2) If you want to get updates with Sqoop during the week you will need something like a last updated at date. If you have that you can look at the approach pbalasundaram wrote about. But I personally don't like this too much since this view does a lot of processing and will make queries slower. If you can recreate the table every night based on the query from the article you should do it. However you need a short outage for this. ( The good thing is that Hadoop is pretty good at writing multi-terabyte data once it is in the cluster so you might be able to get it done at night and do a quick rename operation to update it.

3) Tools like GoldenGate/IBM CDC are definitely an option as well. They monitor the transaction log of the source database and can for example insert into Kafka/Hbase. Even slow speeds can sum up to big volumes for continuous tasks. The problem here is not the speed of these tools but Hive updates which are still very new and mostly usable for streaming inserts. So if you do not want to switch to something like Apache Phoenix as your data store ( which is ok for small aggregation queries with millions of rows but definitely not for fully aggregating a tb table ) you would need to use CDC into Kafka and then write your own storm/spark-streaming etc. app that takes the values from kafka and pushes them into Hive.

However as mentioned Hive ACID is currently very young and mostly good for streaming inserts. Inserting the new data might work well and updating some old values may work as well but a huge amount of updates across a large timerange would ask for trouble. ACID tables also still have some limitations but hopefully they will be much more stable in the near future then this should be a valid option.

http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/