I have to import daily data from SQL Server using Sqoop (1.4.6). Naturally, incremental imports would seem to be the way to go, but here's the thing - the source tables I have to import have separate 'created' and 'updated' columns. That is, newly-created records only have a timestamp value under the 'created' column, and records will only have a value for the 'updated' column if they are edited anytime after first creation. Looking through the documentation, it doesn't look like it's possible to have 2 check columns when doing incremental imports, so the only way I've managed to get this done is with 2 separate imports:
This works, but I wonder if there's a better way to do this. Any thoughts?
Update: I tried the suggestion by @Sourygna Luangsay to use free-form query imports, however the documentation quite clearly states that the query can't contain OR conditions in the WHERE clause. Besides, since these will be incremental imports, the output directory would exist and I'd still need a two-step workflow: the first step for importing, then the second for merging.
To do that with a single query, you could do "manually" the incremental with a custom query in your Sqoop command. Something like that:
sqoop ... -query "select * from sourceTable where created >= yesterday or updated >= yesterday"
I'd thought of that actually, though I try to avoid hard-coding SQL whenever I use Sqoop. To me, it feels a bit hacky. It'd be interesting to hear people's thoughts on that, however.
Just went through the documentation though, and there's a note saying free-form queries can't have OR in the WHERE clause?
Give it a try.
I've just done a quick test and the OR condition worked for me (although I have to admit that I currently use the Sqoop Teradata driver, which is quite special).
I did try it before I updated my post, and the OR condition doesn't work for me. Only the condition on the left side of the OR seems to be evaluated. For example, in the condition "where created >= yesterday or updated >= yesterday" - only "created >= yesterday" is evaluated. I tried reversing the order of the conditions giving me "where updated >= yesterday or created >= yesterday" - and in this case only "updated >= yesterday" is evaluated.