Reply
Explorer
Posts: 6
Registered: ‎06-13-2015

Avoiding skew and determining optimal number of mappers in SQOOP import.

Hi,

 

If there is a primary key on the source table, SQOOP import would generate no skewed data... What if there is no primary key defined on the table and we have to use --split-by parameter to split records among multiple mappers.

 

There are high chances of skewed data depending on the column we select to --split-by.

 

Could you please help me understand how to avoid skewing in such scenarios and also how to determine the optimal number of mappers to be used for any SQOOP import.

 

Thanks

Posts: 1,826
Kudos: 406
Solutions: 292
Registered: ‎07-31-2013

Re: Avoiding skew and determining optimal number of mappers in SQOOP import.

I'm afraid you'd not be able to avoid that situation. However, you do have some query controls you can use to limit the expanse of the query to only target specific key values (and divide them across map tasks fairly), iterating over which you could run multiple jobs to get the data in (as opposed to one single job).

Is the expected skew very large, such that you envision an issue every time you need to make a full import?
Highlighted
New Contributor
Posts: 1
Registered: ‎02-02-2019

Re: Avoiding skew and determining optimal number of mappers in SQOOP import.

When I have had skew in the split-by column, I have used the following approach to break the migration up into multiple jobs. This approach allows you to avoid the skew, parallelize the ingest, and throttle the concurrent work. 

 

#pseudo code...
do_work(){
  sqoop import \
    ... \
    --query "SELECT * FROM myDb.myTable WHERE order_date = $1 AND \$CONDITIONS" 
}

export -f do_work

declare -a order_dates=(20190101, 20190102, ... 20190131, 20190201, ...)

printf "%s\n" "${order_dates[@]}" | xargs --max-procs=3 -I {} bash -c 'do_work "{}"'

 

I wrote a great blog post on how it works.

use xargs to handle split-by skew in sqoop 

 

Kevin

Announcements
New solutions