Several Extracts Job in Datameer (Rapid ETL/BI tool, sits on top of hadoop) are reading data out of salesforce objects. The largest extract is 1.4 GB(Task object) and the smallest extract is 96 MB(account object). Datameer uses REST API Based connector , a SOQL query is supplied to the connector and records are fetched accordingly (https://documentation.datameer.com/documentation/display/DAS60/Salesforce).
Datameer compiles the job and hands over the execution to the execution framework (Tez)
All the saleforce extract jobs run with 1 Map tasks.
There are other extract jobs in datameer that read data from flat files(50 - 200 MB) on a sftp server and use between 3-5 map tasks.
Considering that data from flat file is running with multiple map tasks, does the issue corresponds to SOQL batch size which only pulls 2000 records per request hence resulting in allocation of only 1 mapper.
How does MR program determine total size of the input extract when dealing with source like salesforce or for that matter even a database.