Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

In sqoop import how mapreduce works

In sqoop import how mapreduce works

In sqoop import how mapreduce works in key & value pair in rdbms tables with structure data?

Please explain.

4 REPLIES 4

Re: In sqoop import how mapreduce works

Master Guru
Apache Sqoop is open source, so you can checkout what it does underneath when curiosity strikes.

Consider an import scenario with text output format.

Here's the mapper used for this: https://github.com/cloudera/sqoop/blob/cdh5.15.0-release/src/java/org/apache/sqoop/mapreduce/TextImp...
- Note the K, V input types are LongWritable and 'SqoopRecord'

The data is supplied to a mapper by its InputFormat, or more specifically, its RecordReader. Sqoop reads from DB using JDBC, and its implemented as a RecordReader by this class: https://github.com/cloudera/sqoop/blob/cdh5.15.0-release/src/java/org/apache/sqoop/mapreduce/db/DBRe...

Effectively, for a given query boundary (boundaries decided based on some key's range and number of mappers requested at submit time), a JDBC connection reads each record and passes them as values into the map function which then writes them out into some desired format. The key in the map task is just a local record counter that is wholly ignored as an input.

Re: In sqoop import how mapreduce works

Explorer

Can you point to any code in cloudera in sqoop how it determine split by range for each mapper is determined using split-by column?

Highlighted

Re: In sqoop import how mapreduce works

In rdbms database block size is 8kb and in hadoop block size is 64MB. In sqoop import example my rdbms tables size is 300mb. So it will split into 5 mapper ? Please confirm

Re: In sqoop import how mapreduce works

Explorer

I think the default block size is 128 MB. But anyway this is not the factor that determine number of mapper for sqoop.

 

number of mapper depend on --num-mappers parameter you specify in sqoop import and you also need to mention the  --split-by <column-name>. Based on column name you provided sqoop will find the min and max value and divide it by --num-mappers. Is best to use primary key as the split-by column or any column which has high cardinality to ensure your mappers are balanced.

Don't have an account?
Coming from Hortonworks? Activate your account here