Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Sqoop Import --split-by with sql function

avatar
Explorer

We already have a post on this but the option given is not working and throwing my error and it was more than a year old

previous post https://community.hortonworks.com/questions/146261/sql-function-in-split-by.html

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://hostname:3306/jslice_orders \
--username=** -P --table archive_orders --fields-terminated-by '|' \
--lines-terminated-by '\n' --null-non-string "\\\\N" --null-string "\\\\N" --escaped-by '\' \
--optionally-enclosed-by '\"' --map-column-java dwh_last_modified=String --hive-drop-import-delims \
--as-parquetfile -m 16 --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --delete-target-dir \
--target-dir hdfs:///hive/warehouse/jslice_orders/archive_orders/text3/ --split-by 'cast(order_number as UNSIGNED)

fails with error saying column not found while executing the sqoop command

INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT 
MIN(`cast(order_number as UNSIGNED)`), MAX(`cast(order_number as 
UNSIGNED)`) FROM `archive_orders`

ERROR tool.ImportTool: Encountered IOException running import job: 
java.io.IOException: java.sql.SQLSyntaxErrorException: (conn=472029) 
Unknown column 'cast(order_number as UNSIGNED)' in 'field list'

how can i get this to working since my PK on source tables are all in varchar and tables are off 50GB in size

13 REPLIES 13

avatar

Hi @Krish E
Have you tried to set this property on your sqoop command?

-Dorg.apache.sqoop.splitter.allow_text_splitter=true
Then, in this case, you won't need to cast the PK.

Ps: Not sure if this property works alongside with the others parameters passed, guess it's worth to test it first 🙂

Hope this helps!

avatar
Explorer

My sqoop command above does have that

  1. sqoop import-Dorg.apache.sqoop.splitter.allow_text_splitter=true \

avatar

Hi @Krish E!
Sorry, I miss that 😞
And have you tried to run without the cast? This parameter should allow you to run split-by with a varchar column.
Another thing that's intriguing me, I assume that this Sqoop Job worked in the past right?
If so, could check on the DB side if it's possible to run the following query? (Perhaps find an uncast exception)

select cast(order_number as UNSIGNED) from archive_orders;

I'm not sure if the error is in the sqoop side, cause I took a look at the sqoop github and didn't found any exception related to Unknown Column, so my guess would be that probably the JDBC got the error from MySQL and throw to Sqoop.


Hope this helps!

avatar
Explorer

Thanks for the reply, as i mentioned above all my columns are varchar on the source mysql tables and if i just mention those column name on the --split-by it brings me duplicate records.

This is a new setup of sqoop jobs so this never worked earlier as well. The select stmt on the DB side does work fine.

avatar
Explorer

@vmurakami please see my reply above

avatar

Got it @Krish E!
If you change your split-by for boundary-queries, do you still have the same issue?

--boundary-query "SELECT MIN(cast(order_number as UNSIGNED)), MAX(cast(order_number as UNSIGNED)) FROM archive_orders"
Not sure if your issue isn't related to the single-quote.
INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(` <- cast(order_number as UNSIGNED) ->`), MAX(`<-cast(order_number as UNSIGNED)->`) FROM `archive_orders`

As you using the split-by alongside with the sqoop import-Dorg.apache.sqoop.splitter.allow_text_splitter=true, guess sqoop is taking the whole cast function as a column name, but again, it's just a guess 🙂

Hope this helps!

avatar
Explorer

No the Boundary query does work, but i figured that even though we do split by using a cast on a varchar column, once the split is identified sqoop internally does a sort of the column on split by which is a varchar according to Mysql and it brings in Duplicate on the target system.

I found unless you use a integer column or do an incremental load (after load to staging) will not solve the issues of duplicate on target with Varchar datatype with your split-by column

avatar

Hmmm @Krish E, I'm not sure if I get it right, but in this case, you aren't using a primary key right? Or in fact it's a PK and sqoop is duplicating it?

avatar
Explorer

Its the PK that sqoop is duplicating and this only happens on tables which has millions of rows basically large tables more than 45GB of data.