Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Can Spark SQL replaces Sqoop for Data Ingestion?

avatar
Contributor
 
2 REPLIES 2

avatar

@Sri Kumaran Thiruppathy

I don't think so!

Sqoop and Spark SQL both use JDBC connectivity to fetch the data from RDBMS engines but Sqoop has an edge here since it is specifically made to migrate the data between RDBMS and HDFS. Every single option available in Sqoop has been fine-tuned to get the best performance while doing the data ingestions. You can start with discussing the option -m which control the number of mappers. This is what you need to do to fetch data in parallel from RDBMS. Can I do it in Spark SQL? Of course yes but the developer would need to take care of "multithreading" that Sqoop has been taking care automatically.

And the list goes on!

Hope that helps!

avatar
New Contributor

@RahulSoni I think you're a bit quick to dismiss Spark + JDBC. There is actually a solution for the multithreading - Spark will extract the data to different partitions in parallel, just like when your read an HDFS file. You just have to specify the number of partitions in the extracted dataframe and optimize the parameters for your job (number of executors, cores per executor, memory per executor).

 

While sqoop is easier to use out of the box, the fact that it is based on MapReduce will likely mean that Spark is superior in some scenarios, and it should be your go-to option when you want to save the data as Parquet or ORC (not supported by sqoop). I haven't tried yet, but here's someone who seems quite pleased with how Spark worked for them: https://www.all-things-cloud.com/2018/12/spark-jdbc-instead-of-sqoop.html.

 

Sqoop is supposed to support Avro but when I tried to output Avro files it failed with a low-level error in a Java library. I wasn't too impressed by the performance either, although that could be due to bandwidth problems.