Member since
04-11-2016
7
Posts
1
Kudos Received
0
Solutions
10-30-2018
12:35 AM
CLOUD.HORTONWORKS.COM was just an example...you can change this to whatever you like. If you are using AD, you would probably want to set it to your AD domain
... View more
04-11-2016
07:46 PM
1 Kudo
Basically syncing huge fact tables that are also updated is a pain in the neck. 1) Do it as you suggested and fully reload your tables every week and then run sqoop jobs during the week based on the Incremented ID. The problem is that in this case you do not get updates during the week. If that is possible you can just sqoop into your daily table. Do you really get updates? In Warehouse environments you normally have a FACT table that is not updated and dimension tables that you can indeed reload completely. If you do not you can just continue using sqoop as before using the increment field instead of the date. ( You just need to fix the old data ) 2) If you want to get updates with Sqoop during the week you will need something like a last updated at date. If you have that you can look at the approach pbalasundaram wrote about. But I personally don't like this too much since this view does a lot of processing and will make queries slower. If you can recreate the table every night based on the query from the article you should do it. However you need a short outage for this. ( The good thing is that Hadoop is pretty good at writing multi-terabyte data once it is in the cluster so you might be able to get it done at night and do a quick rename operation to update it. 3) Tools like GoldenGate/IBM CDC are definitely an option as well. They monitor the transaction log of the source database and can for example insert into Kafka/Hbase. Even slow speeds can sum up to big volumes for continuous tasks. The problem here is not the speed of these tools but Hive updates which are still very new and mostly usable for streaming inserts. So if you do not want to switch to something like Apache Phoenix as your data store ( which is ok for small aggregation queries with millions of rows but definitely not for fully aggregating a tb table ) you would need to use CDC into Kafka and then write your own storm/spark-streaming etc. app that takes the values from kafka and pushes them into Hive. However as mentioned Hive ACID is currently very young and mostly good for streaming inserts. Inserting the new data might work well and updating some old values may work as well but a huge amount of updates across a large timerange would ask for trouble. ACID tables also still have some limitations but hopefully they will be much more stable in the near future then this should be a valid option. http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/
... View more