Created 09-26-2016 02:51 PM
EDIT (SOLUTION):
Problem: Can't include the "with ur" statement DB2 requires to keep a table from locking. i.e. a "dirty read."
Solution: Make use of the --boundary-query option in Sqoop.
I modified my Sqoop import statement to leverage the --boundary-query option in Sqoop, which allows me to build the boundary query for determining the splits for the mappers. Typically, since I'm using the --split-by and --query options, this is generated automatically. By adding the new option at the end of my Sqoop statement, and adjusting both it and my standard query to include the desired "with ur", I can achieve a Sqoop job whose multiple mappers fire off non-locking select queries, with ranges of data, with the "with ur" statement.
Working code:
sqoop import -Ddb2.jcc.charsetDecoderEncoder=3 -Dmapred.job.queue.name=default --connect jdbc:db2://mybox.mycompany.com:1234/mydb2db --username myuser --password mypassword --query "select * from ( SELECT column1, column2, column3, current timestamp as load_time, CHAR(current date, iso) as load_dt FROM my_database.my_table where column1 <= 1000 )a WHERE $CONDITIONS with UR" --hcatalog-database myHiveDB --hcatalog-table myHiveTable --split-by column1 --fields-terminated-by \001 -m 5 --relaxed-isolation --fetch-size=100000 --boundary-query "SELECT MIN(column1), MAX(column1) FROM my_database.my_table where column1 <= 1000 with UR"
* This Sqoop command is being dynamically generated from a bash script. If you experience problems adopting the above code, try the following:
- put the whole sqoop command on one line
- use \$CONDITIONS not $CONDITIONS (if this is coming from a text file, not hardcoded, you don't need the escape. If you're typing it directly in the sqoop command, you do
- ensure the information following both --query and --boundary-query are in double quotes as above
---
ORIGINAL QUESTION:
I have this Sqoop command that nearly works... up until I get a syntax exception from DB2:
Sqoop Command:
sqoop import -Ddb2.jcc.charsetDecoderEncoder=3 -Dmapred.job.queue.name=default --connect jdbc:db2://mybox.mycompany.com:1234/mydb2db --username myuser --password mypassword --query SELECT column1, column2, column3, current timestamp as load_time, CHAR(current date, iso) as load_dt FROM my_database.my_table where column1 <= 1000 AND $CONDITIONS with UR --hcatalog-database myHiveDB --hcatalog-table myHiveTable --split-by column1 --fields-terminated-by \001 -m 5 --relaxed-isolation --fetch-size=100000
It works great up to this point. Compares the target data with the hcat/hive table, all checks out. Then it tries to do the splits with the bounding values, and flips.
Error (and log):
16/09/26 10:32:12 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(column1), MAX(column1) FROM (SELECT column1, column2, column3, current timestamp as load_time, CHAR(current date, iso) as load_dt FROM my_database.my_table where column1 <= 1000 AND (1 = 1) with UR) AS t1
Caused by: com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-199, SQLSTATE=42601, SQLERRMC=WITH;AND OR HAVING GROUP INTERSECT ORDER ) FETCH EXCEPT MINUS UNION, DRIVER=4.16.53
I believe it has to do with DB2 needing the "with UR" phrase last, but Sqoop always tacks on the AS t1, causing it to break. Any suggestions to fixing this? We need to do non-locking reads on a DB2 table.
Created 09-26-2016 06:28 PM
I've resolved the issue by using the --boundary-query option in Sqoop, in tandem with my use of --query and --split-by. Will post answer shortly.
Edit: Posted answer at top of my question. Here's the code again for convenience:
sqoop import -Ddb2.jcc.charsetDecoderEncoder=3 -Dmapred.job.queue.name=default --connect jdbc:db2://mybox.mycompany.com:1234/mydb2db --username myuser --password mypassword --query "select * from ( SELECT column1, column2, column3, current timestamp as load_time, CHAR(current date, iso) as load_dt FROM my_database.my_table where column1 <= 1000 )a WHERE $CONDITIONS with UR" --hcatalog-database myHiveDB --hcatalog-table myHiveTable --split-by column1 --fields-terminated-by \001 -m 5 --relaxed-isolation --fetch-size=100000 --boundary-query "SELECT MIN(column1), MAX(column1) FROM my_database.my_table where column1 <= 1000 with UR"
Created 09-26-2016 04:19 PM
Is your query wrapped in quotes? If not, can you try that (single quotes is fine)? If you wrap it in double quotes then you will need to use "\$CONDITIONS" instead of just "$CONDITIONS" so that your shell doesn't use it as a shell variable.
Created 09-26-2016 05:09 PM
I've attempted that, but it still feeds out the same error as above. Am trying now to use JDBC level connection parameters as below, but it doesn't seem to be passing it to the splitted sub queries. Initial connection is with UR, but subsequent subqueries are not.
--connection-param-file /just/some/path/sqoop.properties
jdbc.transaction.isolation=TRANSACTION_READ_UNCOMMITTED
Created 09-26-2016 06:28 PM
I've resolved the issue by using the --boundary-query option in Sqoop, in tandem with my use of --query and --split-by. Will post answer shortly.
Edit: Posted answer at top of my question. Here's the code again for convenience:
sqoop import -Ddb2.jcc.charsetDecoderEncoder=3 -Dmapred.job.queue.name=default --connect jdbc:db2://mybox.mycompany.com:1234/mydb2db --username myuser --password mypassword --query "select * from ( SELECT column1, column2, column3, current timestamp as load_time, CHAR(current date, iso) as load_dt FROM my_database.my_table where column1 <= 1000 )a WHERE $CONDITIONS with UR" --hcatalog-database myHiveDB --hcatalog-table myHiveTable --split-by column1 --fields-terminated-by \001 -m 5 --relaxed-isolation --fetch-size=100000 --boundary-query "SELECT MIN(column1), MAX(column1) FROM my_database.my_table where column1 <= 1000 with UR"