Support Questions

Find answers, ask questions, and share your expertise

How Sqoop internally works

avatar
Expert Contributor

I have below questions.

1. Is sqoop creating sql query internally ? If yes , then How it is getting created and executed for multiple mapper?

2. Is sqoop using any staging node to load the data ? Or is sqoop loading data directly in data node ? How it behaves for different mapper?

3. How sqoop run parallel for multiple mapper ?

Please explain with simple architecture.

1 ACCEPTED SOLUTION

avatar

Most of the answers you are looking for are explained in http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_controlling_parallelism, but here's my 1-2-3 answers to your questions.

  1. Absolutely, Sqoop is building a SQL query (actually one for each mapper) to the source table it is ingesting into HDFS from. The number of mappers (default is four, but you can override) leverage the split-by column and basically Sqoop tries to build an intelligent set of WHERE clauses so that each of the mappers have a logical "slice" of the target table. As and example, if we used three mappers and a split-by column that is an integer with ranges from 0 to 1,000,000 for the actual data (i.e. sqoop can do a pretty easy min and max call to the DB on the split-by column), then Sqoop first mapper would try to get values 0-333333, the second mapper would pull 333334-666666, and the last would grab 666667-1000000.
  2. Nope, Sqoop is running a map-only job which each mapper (3 in my example above) running a query with a specific range to prevent any kind of overlap. The mapper then just drops the data in the target-dir HDFS directory with a file named part-m-00000 (well, the 2nd on ends with 00001 and the 3rd one ends with 00002). The composite export is represented by the target-dir HDFS directory (basically follows the MapReduce naming scheme of files).
  3. I'm hoping your question about parallelism makes sense now.

I'm hopeful this helps out some. As with everything, some simple testing on your own will help it all make sense. As for an architectural diagram, check out the image (and additional details) at http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/using_sqoop_to_mo... which might aid in your understanding. Happy Hadooping!!

View solution in original post

6 REPLIES 6

avatar

Most of the answers you are looking for are explained in http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_controlling_parallelism, but here's my 1-2-3 answers to your questions.

  1. Absolutely, Sqoop is building a SQL query (actually one for each mapper) to the source table it is ingesting into HDFS from. The number of mappers (default is four, but you can override) leverage the split-by column and basically Sqoop tries to build an intelligent set of WHERE clauses so that each of the mappers have a logical "slice" of the target table. As and example, if we used three mappers and a split-by column that is an integer with ranges from 0 to 1,000,000 for the actual data (i.e. sqoop can do a pretty easy min and max call to the DB on the split-by column), then Sqoop first mapper would try to get values 0-333333, the second mapper would pull 333334-666666, and the last would grab 666667-1000000.
  2. Nope, Sqoop is running a map-only job which each mapper (3 in my example above) running a query with a specific range to prevent any kind of overlap. The mapper then just drops the data in the target-dir HDFS directory with a file named part-m-00000 (well, the 2nd on ends with 00001 and the 3rd one ends with 00002). The composite export is represented by the target-dir HDFS directory (basically follows the MapReduce naming scheme of files).
  3. I'm hoping your question about parallelism makes sense now.

I'm hopeful this helps out some. As with everything, some simple testing on your own will help it all make sense. As for an architectural diagram, check out the image (and additional details) at http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/using_sqoop_to_mo... which might aid in your understanding. Happy Hadooping!!

avatar
New Contributor

Very good explanation .

avatar
New Contributor

Hi,

Summons entered through order line are related with a guide assignment to recover information from outside databases. A diminish undertaking will be utilized for setting the recovered information into HDFS/Hbase/Hive.

If U Have any doubts click here: https://tekslate.com

avatar

avatar
@Lester Martin

What are the algorithms are used in sqoop while importing data?

Thanks in Advance!

Dada Karade

avatar

https://stackoverflow.com/questions/45100487/how-data-is-split-into-part-files-in-sqoop can start to explain more, but ultimately (and thanks to the power of open-source) you'll have to go look for yourself - you can find source code at https://github.com/apache/sqoop. Good luck and happy Hadooping!