Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Is it possible to do an incremental import using Sqoop free form query?

avatar
New Contributor

Please point me to documentation for the same. More helpful if I get examples as well

1 ACCEPTED SOLUTION

avatar
New Contributor

@Eyad Garelnabi Found a solution for the join queries.

I am using MySQL DB as source. I tried incremental with Sakila DB which comes with the MySQL installatoin.

I thought I will share my experience that why lengthy response. I apologize for the lengthy answer.

My first sqoop job was:

sqoop job --create myssb1 -- import --connect jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "select * from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS" --incremental lastmodified --check-column last_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String

It was a straight forward join query.

select * from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS

But it gave exception: Imported Failed: Duplicate Column identifier specified: 'city_id'

city_id is available in both city as well as address tables.

I modified the job to:

sqoop job --create myssb1 -- import --connect jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "select a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String

I gave unique alias to individual column in select clause as mentioned in sqoop cook book Chapter 4. "use SQL projection to rename columns in the query so that each column in the output result set has a unique name." The SQL is

select a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS

I am referring alias name for check-column (alast_update) attribute as well in sqoop job.

But it gave exception again: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'alast_update' in 'where clause'

I was going through lot of online material to resolve this issue. Then I saw this answer in stackoverflow (Click here). Though the answer talks about MS SQL, the solution provided there worked for MySQL and PostreSQL as well.

my new job is:

sqoop job --create myssb1 -- import --connect jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address, district, city_id, postal_code, alast_update, cityid, city, country_id, clast_update FROM(SELECT a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update FROM sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String

The SQL is

SELECT address_id, address, district, city_id, postal_code, alast_update, cityid, city, country_id, clast_update 
FROM (
               SELECT a.address_id AS address_id, a.address AS address, a.district AS district, a.city_id AS city_id, a.postal_code AS postal_code, a.last_update AS alast_update, c.city_id AS cityid, c.city AS city, c.country_id AS country_id, c.last_update AS clast_update 
               FROM sakila.address a 
               INNER JOIN sakila.city c 
               ON a.city_id=c.city_id) AS sub 
WHERE $CONDITIONS

The solution worked for us. @Eyad Garelnabi Please let me know your thoughts on this.

Thanks again to @Eyad Garelnabi

Suresh Babu

View solution in original post

7 REPLIES 7

avatar

@Suresh Babu Shanmugam

An Incremental Import in Sqoop is easily replicated in free form query

For example let’s assume you have the following table and the last import was on “2013-01-25” (column: createdOn) which corresponds to the “id” 3 (column:id):

2925-screen-shot-2016-03-21-at-52308-pm.png

You can import the changes/increment since that time using any of the following two “Incremental Update Import” commands (the first by id and the second by last modification date):

2926-screen-shot-2016-03-21-at-52800-pm.png

The equivalent “Free Form Query Import” commands would be:

2927-screen-shot-2016-03-21-at-52958-pm.png

So, if you require flexibility with your incremental import then expand/modify on the free form query above. Just keep in mind that the flag ’--m 1’ signifies that only 1 mapper will be used and jobs will run in serial form. This can be changed as needed.

For information on Incremental Imports see:

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports

For information on Free Form Query see:

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_free_form_query_imports

For advanced examples of Free Form Query Imports I strongly recommend looking at:

https://www.safaribooksonline.com/library/view/apache-sqoop-cookbook/9781449364618/ch04.html

avatar
New Contributor

Hi Eyad Garelnabi,

I apologize for the delay. Your solution works. Thank you.

Actually we are trying Join queries. Till now we are not successful with incremental import.

I will definitely post the solution once we crack join queries incremental import.

Thanks once again,

Suresh

avatar

Great! Looking forward to seeing your final solution 🙂

avatar

@Suresh Babu Shanmugam I find using free form query to be more effective, because you can control exact interval/range for which you want to bring data.

avatar
New Contributor

@Eyad Garelnabi Found a solution for the join queries.

I am using MySQL DB as source. I tried incremental with Sakila DB which comes with the MySQL installatoin.

I thought I will share my experience that why lengthy response. I apologize for the lengthy answer.

My first sqoop job was:

sqoop job --create myssb1 -- import --connect jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "select * from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS" --incremental lastmodified --check-column last_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String

It was a straight forward join query.

select * from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS

But it gave exception: Imported Failed: Duplicate Column identifier specified: 'city_id'

city_id is available in both city as well as address tables.

I modified the job to:

sqoop job --create myssb1 -- import --connect jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "select a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String

I gave unique alias to individual column in select clause as mentioned in sqoop cook book Chapter 4. "use SQL projection to rename columns in the query so that each column in the output result set has a unique name." The SQL is

select a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS

I am referring alias name for check-column (alast_update) attribute as well in sqoop job.

But it gave exception again: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'alast_update' in 'where clause'

I was going through lot of online material to resolve this issue. Then I saw this answer in stackoverflow (Click here). Though the answer talks about MS SQL, the solution provided there worked for MySQL and PostreSQL as well.

my new job is:

sqoop job --create myssb1 -- import --connect jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address, district, city_id, postal_code, alast_update, cityid, city, country_id, clast_update FROM(SELECT a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update FROM sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String

The SQL is

SELECT address_id, address, district, city_id, postal_code, alast_update, cityid, city, country_id, clast_update 
FROM (
               SELECT a.address_id AS address_id, a.address AS address, a.district AS district, a.city_id AS city_id, a.postal_code AS postal_code, a.last_update AS alast_update, c.city_id AS cityid, c.city AS city, c.country_id AS country_id, c.last_update AS clast_update 
               FROM sakila.address a 
               INNER JOIN sakila.city c 
               ON a.city_id=c.city_id) AS sub 
WHERE $CONDITIONS

The solution worked for us. @Eyad Garelnabi Please let me know your thoughts on this.

Thanks again to @Eyad Garelnabi

Suresh Babu

avatar
New Contributor

@Suresh Babu Shanmugam I had the same problem and your solution worked for me. Thank you.

avatar
Master Collaborator

the free query form for lastmodified is not correct, because you can manage the updates rows, and also, you will get duplicate rows, so i dont recomend this solution alternative

109882-captura.jpg


2927-screen-shot-2016-03-21-at-52958-pm.png