Created on 04-28-2018 09:35 PM - edited 09-16-2022 06:09 AM
I know sqoop has an option where we can set number of mappers(default is 4). In real-time projects who decides and how is the no of mappers decided? Do we use default or any arbitrary number? I know some theoretical links which say number of mappers is defined by your hardware and other considerations but it doesn't give me a practical way of deciding. Any help on how its actually done in production would be greatly appropriate.
Created 04-29-2018 11:29 AM
Its best to understand the database system interms of how many connections can be made because the degree of parallelism depends on it that is the number of mappers also the number of cores , processor in the slave ,data block size , size of the data and its schema structure before hitting the production . Its always go to start with sample load of data with lower number of mappers ,the time it takes to complete and gradually increase or adjust accordingly .
Created 04-30-2018 04:10 AM
In fact there is no standard answer for this question as it is purly based on your business model, cluster size, sqoop export/import frequency, data volume, hardware capacity, etc
I can give few points based my experience, hope it may help you
1. 75% of the sqoop scripts (non-priority) will use the default mappers for various reasons as we don't want to use all the available resources for just sqoop alone.
2. Also we don't want to apply all the possible performance tuning methods on those non-priority jobs, as it may disturb the RDBMS (source/target) too.
3. Get in touch with RDBMS owner to see their non-busy hours, identify the priority sqoop scripts (based on your business model), apply the performance tuning methods on the priroity scripts based on data volume (not only rows, 100s of column also matters). Repeat it if you have more than one Databases.
4. Regarding who is responsible... in most of the cases, If you have small cluster being used by very few teams, then developers and admin can work together but if you have a very large cluster being used by so many teams, then it is out of admin's scope.... again it depends