Whenever I run a sqoop job with more than one mapper, it always creates files that have no data in them. I understand that in the absence of a uniformly distributed field to split by, the data will be skewed, but that does not explain why there would be files with literally nothing in them...
Here's an example:
sqoop --options-file opt.txt --table table --hive-import --hive-overwrite --hive-database db --num-mappers 8 --split-by FISCPER
This produced files that look like this:
part-m-00000 Size: 102.2GB
part-m-00001 Size: 0.1kB
part-m-00002 Size: 0.1kB
part-m-00003 Size: 0.1kB
part-m-00004 Size: 0.1kB
part-m-00005 Size: 0.1kB
part-m-00006 Size: 0.1kB
part-m-00007 Size: 121.0GB
Each of the 0.1kB files is empty...all of the data is contained in the two large files.
Number of mappers you provide is a HINT and not guaranteed. In your case, actually only two mappers do the work. The other 6 are not allocated and just generate bogus empty files. See this: https://books.google.com/books?id=bxBnjitgIAYC&pg=PT34&lpg=PT34&dq=sqoop+number+of+mappers+hint&sour...
Search for --num-mappers serves as a hint.
If this a reasonable response, please vote it and accept it as a best answer.
The result set indicates that 8 mappers ran. Each mapper produces a part file, if only two were allocated you would only get two files. This is more likely to do with the skew in the split field, or the split build stage. Can you provide logs for the sqoop job to identify the split points?
I would agree with Simon that the split was uneven. Let's see the logs. Maybe the file had only two big rows :)
The -m or --num-mappers is just a hint to the engine to maintain that degree of parallelism. But its not mandatory to launch those number of tasks always. The mappers count may vary based on you input data. Sqoop client serializes the data, generates the deserializer and sets the inputformat and submits the job to be run. Here, the inputformat controls the number of mappers like it happens in the normal text file processing. This also answers your second question where some mappers launched may not find the start() of the data in the split and will not be run.