About ManneFagerlind

ManneFagerlind · ‎12-03-2019

@RahulSoni I think you're a bit quick to dismiss Spark + JDBC. There is actually a solution for the multithreading - Spark will extract the data to different partitions in parallel, just like when your read an HDFS file. You just have to specify the number of partitions in the extracted dataframe and optimize the parameters for your job (number of executors, cores per executor, memory per executor). While sqoop is easier to use out of the box, the fact that it is based on MapReduce will likely mean that Spark is superior in some scenarios, and it should be your go-to option when you want to save the data as Parquet or ORC (not supported by sqoop). I haven't tried yet, but here's someone who seems quite pleased with how Spark worked for them: https://www.all-things-cloud.com/2018/12/spark-jdbc-instead-of-sqoop.html. Sqoop is supposed to support Avro but when I tried to output Avro files it failed with a low-level error in a Java library. I wasn't too impressed by the performance either, although that could be due to bandwidth problems.

Online	Offline
Last Visited	‎12-04-2019 01:22 AM

Member Since	‎12-03-2019 11:43 PM
Last Visited	‎12-04-2019 01:22 AM
Posts	1

Cloudera Community

Re: Can Spark SQL replaces Sqoop for Data Ingestio...