@Krish E in this case, I guess run this sqoop job with -m 1 and break into batches wouldn't be an option to you, right?
Do you have any other BusinessKey or SK?
Also, we can take a look at the max/min value generated by Sqoop (bounds) and look in deep how many rows each mapper gets (you can see this through Yarn Web UI > App Master ID > Mappers > Logs). And we'll see if is it running evenly.
One last thing, just in case.. came to mind now :D
What about this command below?
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true \ --connect jdbc:mysql://hostname:3306/jslice_orders \ --username=** -P --table archive_orders --fields-terminated-by '|' \ --lines-terminated-by '\n' --null-non-string "\\\\N" --null-string "\\\\N" --escaped-by '\' \ --optionally-enclosed-by '\"' --map-column-java dwh_last_modified=String --hive-drop-import-delims \ --as-parquetfile -m 16 --compress --compression-codec org.apache.hadoop.io.compress.SnappyCodec --delete-target-dir \ --target-dir hdfs:///hive/warehouse/jslice_orders/archive_orders/text3/ --split-by "cast(order_number as UNSIGNED) % 16" \
--boundary-query "SELECT 0,15"
Hope this helps :)
good idea. The cast on split-by is still not working so the above will not work and its just not for 1 tables i am working on i will eventually have to implement the solution on 250 + tables. So doing an 1 time import with 1 mapper and doing an incremental load after would be a better solution.
Anyway doing a dump everyday is not scalable
Got it @Krish E!
Yeah I was going to say the same, usually dump you whole db it isn't worth to do. At least in the common cases..
If you're still intended to make the split-by, take a look at your columns and try to take another candidate as key for split-by (like BK/SK) and comment here to keep our good discussion!
Otherwise, I'd kindly ask you to accept the answer, so the other HCC users can find the solution faster. And open questions according to your issue. :)
Hope this helps!