Reply
New Contributor
Posts: 4
Registered: ‎08-04-2017

In sqoop import how mapreduce works

[ Edited ]

In sqoop import how mapreduce works in key & value pair in rdbms tables with structure data?

Please explain.

Posts: 1,754
Kudos: 371
Solutions: 279
Registered: ‎07-31-2013

Re: In sqoop import how mapreduce works

Apache Sqoop is open source, so you can checkout what it does underneath when curiosity strikes.

Consider an import scenario with text output format.

Here's the mapper used for this: https://github.com/cloudera/sqoop/blob/cdh5.15.0-release/src/java/org/apache/sqoop/mapreduce/TextImp...
- Note the K, V input types are LongWritable and 'SqoopRecord'

The data is supplied to a mapper by its InputFormat, or more specifically, its RecordReader. Sqoop reads from DB using JDBC, and its implemented as a RecordReader by this class: https://github.com/cloudera/sqoop/blob/cdh5.15.0-release/src/java/org/apache/sqoop/mapreduce/db/DBRe...

Effectively, for a given query boundary (boundaries decided based on some key's range and number of mappers requested at submit time), a JDBC connection reads each record and passes them as values into the map function which then writes them out into some desired format. The key in the map task is just a local record counter that is wholly ignored as an input.
Explorer
Posts: 6
Registered: ‎07-06-2018

Re: In sqoop import how mapreduce works

Can you point to any code in cloudera in sqoop how it determine split by range for each mapper is determined using split-by column?

New Contributor
Posts: 4
Registered: ‎08-04-2017

Re: In sqoop import how mapreduce works

In rdbms database block size is 8kb and in hadoop block size is 64MB. In sqoop import example my rdbms tables size is 300mb. So it will split into 5 mapper ? Please confirm

Highlighted
Explorer
Posts: 6
Registered: ‎07-06-2018

Re: In sqoop import how mapreduce works

I think the default block size is 128 MB. But anyway this is not the factor that determine number of mapper for sqoop.

 

number of mapper depend on --num-mappers parameter you specify in sqoop import and you also need to mention the  --split-by <column-name>. Based on column name you provided sqoop will find the min and max value and divide it by --num-mappers. Is best to use primary key as the split-by column or any column which has high cardinality to ensure your mappers are balanced.

Announcements
New solutions