Created 02-10-2017 01:43 PM
How to know the degree of parallelism available in my Hadoop Cluster? I'd like to understand the proper/good value for the number of mappers in job sqoop (--num-mappers argument). HDP version: 2.2.0
http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_controlling_parallelism
Created 02-10-2017 03:37 PM
Number of mappers is determined by the split size. Use --direct-split-size option to specify how much data one mapper will handle. Use split-by to specify on which column you want to do a split. Following is from Sqoop doc:
When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id
whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi
, with (lo, hi)
set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.
If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by
argument. For example, --split-by employee_id
. Sqoop cannot currently split on multi-column
Created 02-10-2017 03:37 PM
Number of mappers is determined by the split size. Use --direct-split-size option to specify how much data one mapper will handle. Use split-by to specify on which column you want to do a split. Following is from Sqoop doc:
When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id
whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi
, with (lo, hi)
set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.
If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by
argument. For example, --split-by employee_id
. Sqoop cannot currently split on multi-column
Created 02-10-2017 04:59 PM
Thank you @mqureshi, but I'd like to know if there is some best pratice to calculate a good/proper number for the argument --num-mappers, for instance: 6, 8, 10, 30, 40 and etc ? How to know which of them is the more appropriate for my sqoop job. Thanks Leonardo
Created 02-10-2017 05:06 PM
Check this link:
https://wiki.apache.org/hadoop/HowManyMapsAndReduces
Target one map job per block. If the file you are reading has five blocks distributed on 3 nodes (or four or five nodes) on five disks, then you should have five mappers, one for each disk.