Many times when I run a sqoop command it ignores the number of mappers i tell it to use.
Common examples are:
Also, sqoop sometimes creates files that have no data in them.
Any ideas about these issues?
If we use the number of mapper more than 1, is that means those many connections will be established on the source side?
Will there be any change in the processing time if we increase the mapperes while using --direct mode in export. can you explain, please.
Can you try the --query option in sqoop?
sqoop import --driver <your driver name> --connect <CONNECTION STRING>/DATABASE=<DB NAME> --query "select * from <TABLE NAME> where \$CONDITIONS" --fields-terminated-by "," --hive-table <TABLE NAME> --split-by <SPLIT COLUMN> --target-dir '<SOME TMP DIRECTORY>' --hive-import -m <NUMBER OF MAPPERS>
The -m or --num-mappers is just a hint to the engine to maintain that degree of parallelism. But its not mandatory to launch those number of tasks always. The mappers count may vary based on you input data. Sqoop client serializes the data, generates the deserializer and sets the inputformat and submits the job to be run. Maybe the inputformat controls the number of mappers like it happens in the normal text file processing. This also answers your second question where some mappers launched may not find the start() of the data in the split and will not be run.
If we specify -m [1 or n], then it's always launch the number of map tasks which we specified with -m option.
If we didn't specify any thing like -m 1 then it will launch by default 4 mapper tasks