So I'm in a bit of a quandary here. I just assumed ownership of the administration and implementation of our Hadoop environment. We are utilizing Zaloni Bedrock to manage the ingests, but sqoop still does the heavy lifting. One thing that I found quickly is that we didn't have our CDC jobs configured in a way that would properly capture all data. They had been built on a date/time field which seemed to have a large number of either empty instances or instances which occurred in the future (as far as 1/1/3000). With that being said, none of the updated data was ever grabbed, because nothing was newer than 1/1/3000 (and even excluding dates over 3 days returned very little plus this made scripting it in Zaloni almost impossible). It was determined our best bet would be to utilize an ID field which is auto-generated by MS-SQL in the source database, however CDC jobs would only grab new records, not updates. The tables in question are far too large to try and run a full reload daily. With that in mind, what my data scientists want is for me to configure the table to fully refresh weekly (on a Sunday) and then run CDCs throughout the week. My question is how do I go about accomplishing this with sqoop and ensure that the CDC fields are correctly defined in the Hive overlay every time the job runs? Zaloni support hasn't been of much use, and admittedly I am very new to Hadoop.
... View more