Created 03-21-2016 12:11 PM
Please point me to documentation for the same. More helpful if I get examples as well
Created 04-11-2016 07:21 AM
@Eyad Garelnabi Found a solution for the join queries.
I am using MySQL DB as source. I tried incremental with Sakila DB which comes with the MySQL installatoin.
I thought I will share my experience that why lengthy response. I apologize for the lengthy answer.
My first sqoop job was:
sqoop job --create myssb1 -- import --connect jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "select * from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS" --incremental lastmodified --check-column last_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String
It was a straight forward join query.
select * from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS
But it gave exception: Imported Failed: Duplicate Column identifier specified: 'city_id'
city_id is available in both city as well as address tables.
I modified the job to:
sqoop job --create myssb1 -- import --connect jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "select a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String
I gave unique alias to individual column in select clause as mentioned in sqoop cook book Chapter 4. "use SQL projection to rename columns in the query so that each column in the output result set has a unique name." The SQL is
select a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS
I am referring alias name for check-column (alast_update) attribute as well in sqoop job.
But it gave exception again: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'alast_update' in 'where clause'
I was going through lot of online material to resolve this issue. Then I saw this answer in stackoverflow (Click here). Though the answer talks about MS SQL, the solution provided there worked for MySQL and PostreSQL as well.
my new job is:
sqoop job --create myssb1 -- import --connect jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address, district, city_id, postal_code, alast_update, cityid, city, country_id, clast_update FROM(SELECT a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update FROM sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String
The SQL is
SELECT address_id, address, district, city_id, postal_code, alast_update, cityid, city, country_id, clast_update FROM ( SELECT a.address_id AS address_id, a.address AS address, a.district AS district, a.city_id AS city_id, a.postal_code AS postal_code, a.last_update AS alast_update, c.city_id AS cityid, c.city AS city, c.country_id AS country_id, c.last_update AS clast_update FROM sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) AS sub WHERE $CONDITIONS
The solution worked for us. @Eyad Garelnabi Please let me know your thoughts on this.
Thanks again to @Eyad Garelnabi
Suresh Babu
Created on 03-21-2016 09:30 PM - edited 08-19-2019 01:44 AM
An Incremental Import in Sqoop is easily replicated in free form query
For example let’s assume you have the following table and the last import was on “2013-01-25” (column: createdOn) which corresponds to the “id” 3 (column:id):
You can import the changes/increment since that time using any of the following two “Incremental Update Import” commands (the first by id and the second by last modification date):
The equivalent “Free Form Query Import” commands would be:
So, if you require flexibility with your incremental import then expand/modify on the free form query above. Just keep in mind that the flag ’--m 1’ signifies that only 1 mapper will be used and jobs will run in serial form. This can be changed as needed.
For information on Incremental Imports see:
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
For information on Free Form Query see:
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_free_form_query_imports
For advanced examples of Free Form Query Imports I strongly recommend looking at:
https://www.safaribooksonline.com/library/view/apache-sqoop-cookbook/9781449364618/ch04.html
Created 04-01-2016 04:41 AM
Hi Eyad Garelnabi,
I apologize for the delay. Your solution works. Thank you.
Actually we are trying Join queries. Till now we are not successful with incremental import.
I will definitely post the solution once we crack join queries incremental import.
Thanks once again,
Suresh
Created 04-01-2016 05:21 AM
Great! Looking forward to seeing your final solution 🙂
Created 04-10-2016 03:05 AM
@Suresh Babu Shanmugam I find using free form query to be more effective, because you can control exact interval/range for which you want to bring data.
Created 04-11-2016 07:21 AM
@Eyad Garelnabi Found a solution for the join queries.
I am using MySQL DB as source. I tried incremental with Sakila DB which comes with the MySQL installatoin.
I thought I will share my experience that why lengthy response. I apologize for the lengthy answer.
My first sqoop job was:
sqoop job --create myssb1 -- import --connect jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "select * from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS" --incremental lastmodified --check-column last_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String
It was a straight forward join query.
select * from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS
But it gave exception: Imported Failed: Duplicate Column identifier specified: 'city_id'
city_id is available in both city as well as address tables.
I modified the job to:
sqoop job --create myssb1 -- import --connect jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "select a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String
I gave unique alias to individual column in select clause as mentioned in sqoop cook book Chapter 4. "use SQL projection to rename columns in the query so that each column in the output result set has a unique name." The SQL is
select a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update from sakila.address a inner join sakila.city c on a.city_id=c.city_id WHERE $CONDITIONS
I am referring alias name for check-column (alast_update) attribute as well in sqoop job.
But it gave exception again: Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 'alast_update' in 'where clause'
I was going through lot of online material to resolve this issue. Then I saw this answer in stackoverflow (Click here). Though the answer talks about MS SQL, the solution provided there worked for MySQL and PostreSQL as well.
my new job is:
sqoop job --create myssb1 -- import --connect jdbc:mysql://<hostname>:<port>/sakila --username admin --password admin --driver=com.mysql.jdbc.Driver --query "SELECT address_id, address, district, city_id, postal_code, alast_update, cityid, city, country_id, clast_update FROM(SELECT a.address_id as address_id, a.address as address, a.district as district, a.city_id as city_id, a.postal_code as postal_code, a.last_update as alast_update, c.city_id as cityid, c.city as city, c.country_id as country_id, c.last_update as clast_update FROM sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) as sub WHERE $CONDITIONS" --incremental lastmodified --check-column alast_update --last-value 1900-01-01 --target-dir /user/cloudera/ssb7 --hive-import --hive-table test.sakila -m 1 --hive-drop-import-delims --map-column-java address=String
The SQL is
SELECT address_id, address, district, city_id, postal_code, alast_update, cityid, city, country_id, clast_update FROM ( SELECT a.address_id AS address_id, a.address AS address, a.district AS district, a.city_id AS city_id, a.postal_code AS postal_code, a.last_update AS alast_update, c.city_id AS cityid, c.city AS city, c.country_id AS country_id, c.last_update AS clast_update FROM sakila.address a INNER JOIN sakila.city c ON a.city_id=c.city_id) AS sub WHERE $CONDITIONS
The solution worked for us. @Eyad Garelnabi Please let me know your thoughts on this.
Thanks again to @Eyad Garelnabi
Suresh Babu
Created 07-06-2017 07:25 AM
@Suresh Babu Shanmugam I had the same problem and your solution worked for me. Thank you.
Created on 07-15-2019 03:31 PM - edited 08-19-2019 01:43 AM
the free query form for lastmodified is not correct, because you can manage the updates rows, and also, you will get duplicate rows, so i dont recomend this solution alternative