Created 07-06-2018 10:41 PM
We already have a post on this but the option given is not working and throwing my error and it was more than a year old
previous post https://community.hortonworks.com/questions/146261/sql-function-in-split-by.html
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \ --connect jdbc:mysql://hostname:3306/jslice_orders \ --username=** -P --table archive_orders --fields-terminated-by '|' \ --lines-terminated-by '\n' --null-non-string "\\\\N" --null-string "\\\\N" --escaped-by '\' \ --optionally-enclosed-by '\"' --map-column-java dwh_last_modified=String --hive-drop-import-delims \ --as-parquetfile -m 16 --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --delete-target-dir \ --target-dir hdfs:///hive/warehouse/jslice_orders/archive_orders/text3/ --split-by 'cast(order_number as UNSIGNED)
fails with error saying column not found while executing the sqoop command
INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`cast(order_number as UNSIGNED)`), MAX(`cast(order_number as UNSIGNED)`) FROM `archive_orders` ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: java.sql.SQLSyntaxErrorException: (conn=472029) Unknown column 'cast(order_number as UNSIGNED)' in 'field list'
how can i get this to working since my PK on source tables are all in varchar and tables are off 50GB in size
Created 07-06-2018 11:13 PM
Hi @Krish E
Have you tried to set this property on your sqoop command?
-Dorg.apache.sqoop.splitter.allow_text_splitter=trueThen, in this case, you won't need to cast the PK.
Ps: Not sure if this property works alongside with the others parameters passed, guess it's worth to test it first 🙂
Hope this helps!
Created 07-07-2018 03:12 PM
My sqoop command above does have that
Created 07-09-2018 03:24 AM
Hi @Krish E!
Sorry, I miss that 😞
And have you tried to run without the cast? This parameter should allow you to run split-by with a varchar column.
Another thing that's intriguing me, I assume that this Sqoop Job worked in the past right?
If so, could check on the DB side if it's possible to run the following query? (Perhaps find an uncast exception)
select cast(order_number as UNSIGNED) from archive_orders;
I'm not sure if the error is in the sqoop side, cause I took a look at the sqoop github and didn't found any exception related to Unknown Column, so my guess would be that probably the JDBC got the error from MySQL and throw to Sqoop.
Hope this helps!
Created 07-09-2018 03:23 AM
Thanks for the reply, as i mentioned above all my columns are varchar on the source mysql tables and if i just mention those column name on the --split-by it brings me duplicate records.
This is a new setup of sqoop jobs so this never worked earlier as well. The select stmt on the DB side does work fine.
Created 07-09-2018 03:25 AM
@vmurakami please see my reply above
Created 07-09-2018 03:49 AM
Got it @Krish E!
If you change your split-by for boundary-queries, do you still have the same issue?
--boundary-query "SELECT MIN(cast(order_number as UNSIGNED)), MAX(cast(order_number as UNSIGNED)) FROM archive_orders"Not sure if your issue isn't related to the single-quote.
INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(` <- cast(order_number as UNSIGNED) ->`), MAX(`<-cast(order_number as UNSIGNED)->`) FROM `archive_orders`
As you using the split-by alongside with the sqoop import-Dorg.apache.sqoop.splitter.allow_text_splitter=true, guess sqoop is taking the whole cast function as a column name, but again, it's just a guess 🙂
Hope this helps!
Created 07-09-2018 06:22 PM
No the Boundary query does work, but i figured that even though we do split by using a cast on a varchar column, once the split is identified sqoop internally does a sort of the column on split by which is a varchar according to Mysql and it brings in Duplicate on the target system.
I found unless you use a integer column or do an incremental load (after load to staging) will not solve the issues of duplicate on target with Varchar datatype with your split-by column
Created 07-10-2018 06:02 PM
Hmmm @Krish E, I'm not sure if I get it right, but in this case, you aren't using a primary key right? Or in fact it's a PK and sqoop is duplicating it?
Created 07-11-2018 02:09 PM
Its the PK that sqoop is duplicating and this only happens on tables which has millions of rows basically large tables more than 45GB of data.