New Contributor
Posts: 1
Registered: ‎04-28-2018

Number of mappers in sqoop

I know sqoop has an option where we can set number of mappers(default is 4). In real-time projects who decides and how is the no of mappers decided? Do we use default or any arbitrary number? I know some theoretical links which say number of mappers is defined by your hardware and other considerations but it doesn't give me a practical way of deciding. Any help on how its actually done in production would be greatly appropriate.

Posts: 777
Registered: ‎05-16-2016

Re: Number of mappers in sqoop

Its best to understand the database system  interms of how many  connections can be made because the degree of parallelism depends on it that is the number of mappers also the number of cores , processor in the slave ,data block size ,  size of the data and its schema structure  before hitting the production . Its always go to start with sample load of data with lower number of mappers ,the time it takes to complete  and gradually increase or adjust accordingly .  

Posts: 519
Topics: 14
Kudos: 92
Solutions: 45
Registered: ‎09-02-2016

Re: Number of mappers in sqoop



In fact there is no standard answer for this question as it is purly based on your business model, cluster size, sqoop export/import frequency, data volume, hardware capacity, etc


I can give few points based my experience, hope it may help you

1.  75% of the sqoop scripts (non-priority) will use the default mappers for various reasons as we don't want to use all the available resources for just sqoop alone.

2. Also we don't want to apply all the possible performance tuning methods on those non-priority jobs, as it may disturb the RDBMS (source/target) too.

3. Get in touch with RDBMS owner to see their non-busy hours, identify the priority sqoop scripts (based on your business model), apply the performance tuning methods on the priroity scripts based on data volume (not only rows, 100s of column also matters). Repeat it if you have more than one Databases. 

4. Regarding who is responsible... in most of the cases, If you have small cluster being used by very few teams, then developers and admin can work together but if you have a very large cluster being used by so many teams, then it is out of admin's scope.... again it depends