Support Questions

Leo_BR · ‎02-10-2017

How to know the degree of parallelism available in my Hadoop Cluster? I'd like to understand the proper/good value for the number of mappers in job sqoop (--num-mappers argument). HDP version: 2.2.0

http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_controlling_parallelism

mqureshi · ‎02-10-2017

@Leonardo Araujo

Number of mappers is determined by the split size. Use --direct-split-size option to specify how much data one mapper will handle. Use split-by to specify on which column you want to do a split. Following is from Sqoop doc:

When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id. Sqoop cannot currently split on multi-column

View solution in original post

mqureshi · ‎02-10-2017

@Leonardo Araujo

Number of mappers is determined by the split size. Use --direct-split-size option to specify how much data one mapper will handle. Use split-by to specify on which column you want to do a split. Following is from Sqoop doc:

When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by employee_id. Sqoop cannot currently split on multi-column

Leo_BR · ‎02-10-2017

Thank you @mqureshi, but I'd like to know if there is some best pratice to calculate a good/proper number for the argument --num-mappers, for instance: 6, 8, 10, 30, 40 and etc ? How to know which of them is the more appropriate for my sqoop job. Thanks Leonardo

mqureshi · ‎02-10-2017

@Leonardo Araujo

Check this link:

https://wiki.apache.org/hadoop/HowManyMapsAndReduces

Target one map job per block. If the file you are reading has five blocks distributed on 3 nodes (or four or five nodes) on five disks, then you should have five mappers, one for each disk.