Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Sqoop Importing Files with no data in them

Highlighted

Sqoop Importing Files with no data in them

Contributor

Whenever I run a sqoop job with more than one mapper, it always creates files that have no data in them. I understand that in the absence of a uniformly distributed field to split by, the data will be skewed, but that does not explain why there would be files with literally nothing in them...

Here's an example:

sqoop --options-file opt.txt --table table --hive-import --hive-overwrite --hive-database db --num-mappers 8 --split-by FISCPER

This produced files that look like this:

part-m-00000 Size: 102.2GB

part-m-00001 Size: 0.1kB

part-m-00002 Size: 0.1kB

part-m-00003 Size: 0.1kB

part-m-00004 Size: 0.1kB

part-m-00005 Size: 0.1kB

part-m-00006 Size: 0.1kB

part-m-00007 Size: 121.0GB

Each of the 0.1kB files is empty...all of the data is contained in the two large files.

8 REPLIES 8
Highlighted

Re: Sqoop Importing Files with no data in them

@Josh Persinger

Number of mappers you provide is a HINT and not guaranteed. In your case, actually only two mappers do the work. The other 6 are not allocated and just generate bogus empty files. See this: https://books.google.com/books?id=bxBnjitgIAYC&pg=PT34&lpg=PT34&dq=sqoop+number+of+mappers+hint&sour...

Search for --num-mappers serves as a hint.

If this a reasonable response, please vote it and accept it as a best answer.

Re: Sqoop Importing Files with no data in them

Contributor

Why are only two mappers doing the work?

Highlighted

Re: Sqoop Importing Files with no data in them

Resources available (proposed 8, allocated 2).

Highlighted

Re: Sqoop Importing Files with no data in them

Guru

The result set indicates that 8 mappers ran. Each mapper produces a part file, if only two were allocated you would only get two files. This is more likely to do with the skew in the split field, or the split build stage. Can you provide logs for the sqoop job to identify the split points?

Highlighted

Re: Sqoop Importing Files with no data in them

I would agree with Simon that the split was uneven. Let's see the logs. Maybe the file had only two big rows :)

Highlighted

Re: Sqoop Importing Files with no data in them

Guru

What sort of data do you have in the column FISCPER ? Is it a column with low cardinality values ?

Highlighted

Re: Sqoop Importing Files with no data in them

Contributor

@srai

It has 172 values.

Highlighted

Re: Sqoop Importing Files with no data in them

Contributor

The -m or --num-mappers is just a hint to the engine to maintain that degree of parallelism. But its not mandatory to launch those number of tasks always. The mappers count may vary based on you input data. Sqoop client serializes the data, generates the deserializer and sets the inputformat and submits the job to be run. Here, the inputformat controls the number of mappers like it happens in the normal text file processing. This also answers your second question where some mappers launched may not find the start() of the data in the split and will not be run.

Don't have an account?
Coming from Hortonworks? Activate your account here