Support Questions
Find answers, ask questions, and share your expertise

Best practice to import data from SQL server to Hive through Sqoop

Best practice to import data from SQL server to Hive through Sqoop

Hi,

We are working on to import data from MS SQL Server to hive through Sqoop. If we use the incremental & append mode which is the requirement then we need to specify the --last-value of the row id which we inserted last time.

I have to update about 100 tables into Hive.

1. What is the practice to save the value of row id for all tables and specify in the sqoop --last-value command ?

2. Why does not Sqoop itself check the row id of the source & destination table, finally update the rows onwards the last row id value of the destination table?

3. If i save the last value of row id for all tables in a hive table and want to use those values in Sqoop job then how it's possible?

All and above, i want to automate the data importing job so that i do not have to provide the value manually for each table data import per day

Any pointers ?

Thanks

9 REPLIES 9

Re: Best practice to import data from SQL server to Hive through Sqoop

You can use a saved sqoop job. This will save the value in the saved job config file.

"This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information."

https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_incremental_imports

Re: Best practice to import data from SQL server to Hive through Sqoop

I am trying to execute the following job sqoop job --create import_job_staging --hive-import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver --connect "jdbc:sqlserver://asdf.com:1433;database=a" --table asdf --hive-table db.t_a --username sa --password a? Its throwing Error parsing arguments for job exception.When i am trying --import switch instead of the --hive-import switch then it works fine. Any idea how i could make it work?I executed job with --import switch but don't know where the data imported on HDFS and how to insert it into the hive table ?

Re: Best practice to import data from SQL server to Hive through Sqoop

Mentor

Incremental import

Requires two additional parameters: --check-column indicates a column name that should be checked for newly appended data, and --last-value contains the last value that successfully imported into Hadoop

This example below will transfer only those rows whose value in column id is greater than 5:

sqoop import \

--connect jdbc:mysql://mysql.example.com/sqoop \

--username sqoop \

--password dataman \

--table charges \

--incremental append \

--check-column id \

--last-value 5

Incremental import in append mode will allow you to transfer only the newly created rows. Any changed rows that were already imported from previous runs won’t be transmitted again. This method is meant for tables that are not updating rows. Incrementally

Importing Mutable Data

Use the lastmodified mode instead of the append mode. For example, use the following command to transfer rows whose value in column last_update_date is greater than 2015-01-22 01:01:01:

sqoop import \

--connect jdbc:mysql://mysql.example.com/sqoop \

--username sqoop \

--password sqoop \

--table charges \

--incremental lastmodified \

--check-column last_update_date \

--last-value "2015-02-22 01:01:01"

Replace the mysql://mysql.example.com/sqoop with appropriate value

Re: Best practice to import data from SQL server to Hive through Sqoop

I am trying to execute the following job sqoop job --create import_job_staging --hive-import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver --connect "jdbc:sqlserver://asdf.com:1433;database=a" --table asdf --hive-table db.t_a --username sa --password a? Its throwing Error parsing arguments for job exception.When i am trying --import switch instead of the --hive-import switch then it works fine. Any idea how i could make it work?I executed job with --import switch but don't know where the data imported on HDFS and how to insert it into the hive table ?

Re: Best practice to import data from SQL server to Hive through Sqoop

Mentor

Incremental import

Requires two additional parameters: --check-column indicates a column name that should be checked for newly appended data, and --last-value contains the last value that successfully imported into Hadoop

This example below will transfer only those rows whose value in column id is greater than 5:

sqoop import \

--connect jdbc:mysql://mysql.example.com/sqoop \

--username sqoop \

--password dataman \

--table charges \

--incremental append \

--check-column id \

--last-value 5

Incremental import in append mode will allow you to transfer only the newly created rows. Any changed rows that were already imported from previous runs won’t be transmitted again. This method is meant for tables that are not updating rows. Incrementally

Importing Mutable Data

Use the lastmodified mode instead of the append mode. For example, use the following command to transfer rows whose value in column last_update_date is greater than 2015-01-22 01:01:01:

sqoop import \

--connect jdbc:mysql://mysql.example.com/sqoop \

--username sqoop \

--password sqoop \

--table charges \

--incremental lastmodified \

--check-column last_update_date \

--last-value "2015-02-22 01:01:01"

Replace the mysql://mysql.example.com/sqoop with appropriate value

Re: Best practice to import data from SQL server to Hive through Sqoop

@Nirvana India It's the best to create Sqoop jobs, in that way Sqoop will keep track of lastmodified values. See here for an example how to create and run a Sqoop job for lastmodified. Add hive related options to import to Hive. If you have 100 tables then you need 100 jobs. Put the RDBMS passowrd in a file when creating jobs, and run those 100 jobs in a batch managed by crontab or by Oozie once a day or every hour. That's it, fully automatic, no need to keep track of lastvalues by yourself.

Row-id implies append, that's a different incremental import from lastmodified. Both are supported by Sqoop jobs. However, append mode is not supported when importing directly to Hive. You can do that by impoting to HDFS and create external Hive tables based those target directories in HDFS. See here for such an example.

Re: Best practice to import data from SQL server to Hive through Sqoop

I am trying to execute the following job sqoop job --create import_job_staging --hive-import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver --connect "jdbc:sqlserver://asdf.com:1433;database=a" --table asdf --hive-table db.t_a --username sa --password a? Its throwing Error parsing arguments for job exception.When i am trying --import switch instead of the --hive-import switch then it works fine. Any idea how i could make it work?I executed job with --import switch but don't know where the data imported on HDFS and how to insert it into the hive table ?

Re: Best practice to import data from SQL server to Hive through Sqoop

The syntax to define a Sqoop job is $sqoop job --create <job-name> -- <sqoop-tool> <other options> Note a space between "--" and sqoop-tool. "import" is one Sqoop tool. In you case, with --import it somehow works without the space and without "--hive-import" but with "--hive-table". The import file goes either to /apps/hive/warehouse/db.db/t_a (most likely) or /user/<user-name>/db/t_a. This is not a preferred way. You need to give both "--hive-import" and "--hive table". And also, in you case you are always importing the whole table.

Re: Best practice to import data from SQL server to Hive through Sqoop

Incremental import directly into Hive is not yet supported. Lastmodified is supported, try to create such a job based on links I provided in my response above. You can also import to a Hive external table by importing to hdfs and defining a Hive external table based on "target-dir". Such a sample is given here.