I need to load Oracle data via Sqoop.
The data must be available in Hive / Impala and Spark batch, the best way is parquet/snappy format.
In Hive, data must be partitioned by key date (like 2018-01-01).
Once a day, I'm going to delete some partitions (a date window) and repeat the process, run Sqoop again, and add new data with the days that have been removed and more new records.
What would be the best way to do this?
I tried several ways without success.
Thx a lot guys!
You didn't say what are the challenges that you have.
You can use some inventory (book keeping) table in Oracle db to know all the time what date/day/hour you already imported or need to reload in Hadoop. You keep feeding this table with info about what you need to load.
On Hadoop side you can use perl DBI to connect to Oracle db and read the book keeping table so you know what is next for loading and mark flag/column imported = 'Y' in bookeeping if import is successfull (mark other flags based on the events if failed, errors, etc in same book keeping table, you can even capture the errors and log it into same book keeping table for current import so at the end you can simply query book keeping to look for errors).
You can then dinamically build, after reading book keeping table, in the same Perl script, the Sqoop query to load the data into Hadoop (at this stage you know what period you want, for what table, etc - all is in book keeping so you only need to loop that and Sqoop it).
Something like this to build the Sqoop command:
my $astrDir = shift (@_);
my $aStrAdaptor = shift (@_);
my $astrTable = shift (@_);
my $astrQuery = shift (@_);
my $astrConn = "";
my $delimiter = $DefaultDelimiter;
UpdateLogIfTrace("Building Sqoop command string", 1);
$delimiter = $AdapterDelimiter if ( $AdapterDelimiter );
$astrConn = "sqoop import $SQOOP_OPTS --connect \"" . $StrConnectSqoop . "\" --username $StrUserSqoop --password $StrPasswordSqoop " .
"--append --query \"" . $astrQuery . ' " --target-dir ' . $astrDir . " --fields-terminated-by '$delimiter' -m 1 --null-string '\\\\N' --null-non-string '\\\\N' --hive-delims-replacement ''";
$astrConn .= " --driver $dbDriver" if ( $dbDriver );
# loop book keeping table
$sqoopCmd = GetSqoopCommandString($tmpDir, $owner, $table_name, $sqlQuery);
Execute ($sqoopCmd, "Exception", \$sqoopDuration, $StrLogFile, \@retCode);
# end loop book keeping table