I have a general question about how sqoop works.
When you run an import, does sqoop take a snapshot of the table at that time and import that result? Or does it keep an active window of the data?
If the table is rapidly changing, would that cause issues with the data import?
My import query looks like this: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (ORDER BY COL1) R FROM TABLE) A WHERE \$CONDITIONS
Upon doing some QA on the imported data to HDFS, I am seeing the overall row count is close to the source table, but the data is inconsistent. For example, there are plenty of missing rows and duplicated rows. This is a large table sitting at ~560 million rows.
I am calculating the row number in order to get a perfect split-by, but I think that may be where the problem is coming from. It takes a few minutes to calculate the row numbers and my hunch is that while that's happening, the table is being updated at the source and it is throwing the row numbers and subsequent import off.
Does this table have a timestamp column or a column that can be indexed as a highest value column. If you do, then sqoop job would be the most efficient and consistent way to move data from source to you destination, without loosing data.
Please read this tutorial to extract data to your source in a consistent way
Unfortunately there is no timestamp or any other good candidate for a split-by, which is why I was trying to generate row numbers on the fly
Can you write triggers that can copy the data into a newer table and may be you can introduce the timestamp and mark the transactions if it was update, insert or a delete. I know it may not be possible to influence the source databases..