Support Questions

Find answers, ask questions, and share your expertise

Sqoop Import --split-by with sql function

Explorer

We already have a post on this but the option given is not working and throwing my error and it was more than a year old

previous post https://community.hortonworks.com/questions/146261/sql-function-in-split-by.html

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://hostname:3306/jslice_orders \
--username=** -P --table archive_orders --fields-terminated-by '|' \
--lines-terminated-by '\n' --null-non-string "\\\\N" --null-string "\\\\N" --escaped-by '\' \
--optionally-enclosed-by '\"' --map-column-java dwh_last_modified=String --hive-drop-import-delims \
--as-parquetfile -m 16 --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --delete-target-dir \
--target-dir hdfs:///hive/warehouse/jslice_orders/archive_orders/text3/ --split-by 'cast(order_number as UNSIGNED)

fails with error saying column not found while executing the sqoop command

INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT 
MIN(`cast(order_number as UNSIGNED)`), MAX(`cast(order_number as 
UNSIGNED)`) FROM `archive_orders`

ERROR tool.ImportTool: Encountered IOException running import job: 
java.io.IOException: java.sql.SQLSyntaxErrorException: (conn=472029) 
Unknown column 'cast(order_number as UNSIGNED)' in 'field list'

how can i get this to working since my PK on source tables are all in varchar and tables are off 50GB in size

13 REPLIES 13

Hi @Krish E
Have you tried to set this property on your sqoop command?

-Dorg.apache.sqoop.splitter.allow_text_splitter=true
Then, in this case, you won't need to cast the PK.

Ps: Not sure if this property works alongside with the others parameters passed, guess it's worth to test it first 🙂

Hope this helps!

Explorer

My sqoop command above does have that

  1. sqoop import-Dorg.apache.sqoop.splitter.allow_text_splitter=true \

Hi @Krish E!
Sorry, I miss that 😞
And have you tried to run without the cast? This parameter should allow you to run split-by with a varchar column.
Another thing that's intriguing me, I assume that this Sqoop Job worked in the past right?
If so, could check on the DB side if it's possible to run the following query? (Perhaps find an uncast exception)

select cast(order_number as UNSIGNED) from archive_orders;

I'm not sure if the error is in the sqoop side, cause I took a look at the sqoop github and didn't found any exception related to Unknown Column, so my guess would be that probably the JDBC got the error from MySQL and throw to Sqoop.


Hope this helps!

Explorer

Thanks for the reply, as i mentioned above all my columns are varchar on the source mysql tables and if i just mention those column name on the --split-by it brings me duplicate records.

This is a new setup of sqoop jobs so this never worked earlier as well. The select stmt on the DB side does work fine.

Explorer

@vmurakami please see my reply above

Got it @Krish E!
If you change your split-by for boundary-queries, do you still have the same issue?

--boundary-query "SELECT MIN(cast(order_number as UNSIGNED)), MAX(cast(order_number as UNSIGNED)) FROM archive_orders"
Not sure if your issue isn't related to the single-quote.
INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(` <- cast(order_number as UNSIGNED) ->`), MAX(`<-cast(order_number as UNSIGNED)->`) FROM `archive_orders`

As you using the split-by alongside with the sqoop import-Dorg.apache.sqoop.splitter.allow_text_splitter=true, guess sqoop is taking the whole cast function as a column name, but again, it's just a guess 🙂

Hope this helps!

Explorer

No the Boundary query does work, but i figured that even though we do split by using a cast on a varchar column, once the split is identified sqoop internally does a sort of the column on split by which is a varchar according to Mysql and it brings in Duplicate on the target system.

I found unless you use a integer column or do an incremental load (after load to staging) will not solve the issues of duplicate on target with Varchar datatype with your split-by column

Hmmm @Krish E, I'm not sure if I get it right, but in this case, you aren't using a primary key right? Or in fact it's a PK and sqoop is duplicating it?

Explorer

Its the PK that sqoop is duplicating and this only happens on tables which has millions of rows basically large tables more than 45GB of data.

@Krish E in this case, I guess run this sqoop job with -m 1 and break into batches wouldn't be an option to you, right?
Do you have any other BusinessKey or SK?
Also, we can take a look at the max/min value generated by Sqoop (bounds) and look in deep how many rows each mapper gets (you can see this through Yarn Web UI > App Master ID > Mappers > Logs). And we'll see if is it running evenly.
One last thing, just in case.. came to mind now 😄
What about this command below?

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://hostname:3306/jslice_orders \
--username=** -P --table archive_orders --fields-terminated-by '|' \
--lines-terminated-by '\n' --null-non-string "\\\\N" --null-string "\\\\N" --escaped-by '\' \
--optionally-enclosed-by '\"' --map-column-java dwh_last_modified=String --hive-drop-import-delims \
--as-parquetfile -m 16 --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --delete-target-dir \
--target-dir hdfs:///hive/warehouse/jslice_orders/archive_orders/text3/ --split-by "cast(order_number as UNSIGNED) % 16" \ 
--boundary-query "SELECT 0,15"


Hope this helps 🙂

Explorer

good idea. The cast on split-by is still not working so the above will not work and its just not for 1 tables i am working on i will eventually have to implement the solution on 250 + tables. So doing an 1 time import with 1 mapper and doing an incremental load after would be a better solution.

Anyway doing a dump everyday is not scalable

Got it @Krish E!

Yeah I was going to say the same, usually dump you whole db it isn't worth to do. At least in the common cases..
If you're still intended to make the split-by, take a look at your columns and try to take another candidate as key for split-by (like BK/SK) and comment here to keep our good discussion!
Otherwise, I'd kindly ask you to accept the answer, so the other HCC users can find the solution faster. And open questions according to your issue. 🙂

Hope this helps!

Contributor

hi @elkrish ,

 

Was this resolved ?? can u share if you found a solution for this issue ??