Support Questions

elkrish · ‎07-06-2018

We already have a post on this but the option given is not working and throwing my error and it was more than a year old

previous post https://community.hortonworks.com/questions/146261/sql-function-in-split-by.html

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://hostname:3306/jslice_orders \
--username=** -P --table archive_orders --fields-terminated-by '|' \
--lines-terminated-by '\n' --null-non-string "\\\\N" --null-string "\\\\N" --escaped-by '\' \
--optionally-enclosed-by '\"' --map-column-java dwh_last_modified=String --hive-drop-import-delims \
--as-parquetfile -m 16 --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --delete-target-dir \
--target-dir hdfs:///hive/warehouse/jslice_orders/archive_orders/text3/ --split-by 'cast(order_number as UNSIGNED)

fails with error saying column not found while executing the sqoop command

INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT 
MIN(`cast(order_number as UNSIGNED)`), MAX(`cast(order_number as 
UNSIGNED)`) FROM `archive_orders`

ERROR tool.ImportTool: Encountered IOException running import job: 
java.io.IOException: java.sql.SQLSyntaxErrorException: (conn=472029) 
Unknown column 'cast(order_number as UNSIGNED)' in 'field list'

how can i get this to working since my PK on source tables are all in varchar and tables are off 50GB in size

vmurakami · ‎07-11-2018

@Krish E in this case, I guess run this sqoop job with -m 1 and break into batches wouldn't be an option to you, right?
Do you have any other BusinessKey or SK?
Also, we can take a look at the max/min value generated by Sqoop (bounds) and look in deep how many rows each mapper gets (you can see this through Yarn Web UI > App Master ID > Mappers > Logs). And we'll see if is it running evenly.
One last thing, just in case.. came to mind now 😄
What about this command below?

sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect jdbc:mysql://hostname:3306/jslice_orders \
--username=** -P --table archive_orders --fields-terminated-by '|' \
--lines-terminated-by '\n' --null-non-string "\\\\N" --null-string "\\\\N" --escaped-by '\' \
--optionally-enclosed-by '\"' --map-column-java dwh_last_modified=String --hive-drop-import-delims \
--as-parquetfile -m 16 --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --delete-target-dir \
--target-dir hdfs:///hive/warehouse/jslice_orders/archive_orders/text3/ --split-by "cast(order_number as UNSIGNED) % 16" \ 
--boundary-query "SELECT 0,15"

Hope this helps 🙂

elkrish · ‎07-11-2018

good idea. The cast on split-by is still not working so the above will not work and its just not for 1 tables i am working on i will eventually have to implement the solution on 250 + tables. So doing an 1 time import with 1 mapper and doing an incremental load after would be a better solution.

Anyway doing a dump everyday is not scalable

vmurakami · ‎07-11-2018

Got it @Krish E!

Yeah I was going to say the same, usually dump you whole db it isn't worth to do. At least in the common cases..
If you're still intended to make the split-by, take a look at your columns and try to take another candidate as key for split-by (like BK/SK) and comment here to keep our good discussion!
Otherwise, I'd kindly ask you to accept the answer, so the other HCC users can find the solution faster. And open questions according to your issue. 🙂

Hope this helps!

sow · ‎04-13-2020

hi @elkrish ,

Was this resolved ?? can u share if you found a solution for this issue ??

Cloudera Community

Support Questions

Sqoop Import --split-by with sql function