Support Questions

Find answers, ask questions, and share your expertise

In sqoop import how mapreduce works

In sqoop import how mapreduce works in key & value pair in rdbms tables with structure data?

Please explain.


Master Guru
Apache Sqoop is open source, so you can checkout what it does underneath when curiosity strikes.

Consider an import scenario with text output format.

Here's the mapper used for this:
- Note the K, V input types are LongWritable and 'SqoopRecord'

The data is supplied to a mapper by its InputFormat, or more specifically, its RecordReader. Sqoop reads from DB using JDBC, and its implemented as a RecordReader by this class:

Effectively, for a given query boundary (boundaries decided based on some key's range and number of mappers requested at submit time), a JDBC connection reads each record and passes them as values into the map function which then writes them out into some desired format. The key in the map task is just a local record counter that is wholly ignored as an input.


Can you point to any code in cloudera in sqoop how it determine split by range for each mapper is determined using split-by column?

In rdbms database block size is 8kb and in hadoop block size is 64MB. In sqoop import example my rdbms tables size is 300mb. So it will split into 5 mapper ? Please confirm


I think the default block size is 128 MB. But anyway this is not the factor that determine number of mapper for sqoop.


number of mapper depend on --num-mappers parameter you specify in sqoop import and you also need to mention the  --split-by <column-name>. Based on column name you provided sqoop will find the min and max value and divide it by --num-mappers. Is best to use primary key as the split-by column or any column which has high cardinality to ensure your mappers are balanced.