Support Questions

Find answers, ask questions, and share your expertise

Sqoop - dynamically import from SQL server

avatar
New Contributor

I have a log table on SQL server, it tells me which tables need to be loaded into HDFS. Can I use sqoop to loop through the log table and load only the tables flagged out?

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Grace Xu you'd have something like below, remove call to print when your import statement is ready. Test this first obviously.

collection = ['tbl1', 'tbl2', 'tbl3']

for i in collection:

print 'sqoop import --connnect <connect-str> --table ' + i + ' --target-dir /dest/' + i

keep in mind this is Python so take care of your indents, at this point your most important work is to figure out how to populate your collection. Cheers!

View solution in original post

6 REPLIES 6

avatar
Master Mentor

@Grace Xu you might need to write a shell script to get a list of tables and iterate through the collection to sqoop.

avatar
New Contributor

@Artem Ervits

Thanks. And do you know how the sqoop query looks like to iterate through a collection and distribute these to different directories?

avatar

Your wrapper script will be generating a sqoop command and substituting values when invoking sqoop multiple times. Use any scripting language you are comfortable with and which can access your RDBMS.

avatar
Master Mentor

@Grace Xu you'd have something like below, remove call to print when your import statement is ready. Test this first obviously.

collection = ['tbl1', 'tbl2', 'tbl3']

for i in collection:

print 'sqoop import --connnect <connect-str> --table ' + i + ' --target-dir /dest/' + i

keep in mind this is Python so take care of your indents, at this point your most important work is to figure out how to populate your collection. Cheers!

avatar

@Grace Xu Here is another approach for you (a script will definitely work as well but will not make use of oozie)

I believe we can make use of decision node and do the following to get this done via oozie -

Assumption: I am assuming the log table has some kind of id (log_entry_id) associated with the table names along with some other attributes that can be used by sqoop job like (HDFS/Hive destination, columns to import - IF not all etc)

This flow can be adjusted to match whatever constraints and existing design you are working with. For e.g. you can use the table name from the log if you do not have an id. Also, you may have some tables that were relevant (active) in past but not any more so you may have active flag in log table that can be utilized in the 2) Shell action below to figure out if that table has to be imported or not.. etc etc.. you get the idea.

This is a general structure that can be customized.

the WorkFlow will have the following key steps - 

1) Java action - Write a small java program that uses JDBC to connect to SQL Server, reads the data from log table and creates a comma delimited file on HDFS like /tmp/inegst-tables.txt 

2) Shell action (Input parameter : log_entry_id) - read the file from HDFS and get the line starting with (log_entry_id+1). The script will output the value in Java Properties format (like param1=value1,param2=vaule2) etc.. The workflow node will capture the output value for use in the subsequent steps. If the shell script did not find the next row, it will return -1. 

3) Decision Node - if the value is less than 0, then go to END else go to sqoop node

4) Sqoop Node - Execute sqoop task using the table name, destination etc received from the Shell  action capture output. On success go to previous Shell Action else go to End

Workflow Execution : When executing the workflow a default starting value of 0 will be provided for log_entry_id.

avatar
Rising Star

Dear Grace,

We can start with this template and improve it :

#!/bin/bash

kinit ......

hdfs dfs -rm -r hdfs://....

sqoop import --connect "jdbc:sqlserver://....:1433;username=.....;password=….;database=....DB" --table ..... \

-m 1 --where "...... > 0"

CR=$?

if [ $CR -ne 0 ]; then

echo 'Sqoop job failed'

exit 1

fi

hdfs dfs -cat hdfs://...../* > export_fs_table.txt

CR=$?

if [ $CR -ne 0 ]; then

echo 'hdfs cat failed'

exit 1

fi

while IFS=',' read -r id tablename nbr flag; do

sqoop import --connect "jdbc:sqlserver://......:1433;username=......;password=......;database=.......DB" --table $tablename

CR=$?

if [ $CR -ne 0 ]; then

echo 'sqoop import failed for '$tablename

exit 1

fi

done < export_fs_table.txt

Kind regards