Member since
12-03-2019
1
Post
0
Kudos Received
0
Solutions
12-03-2019
11:41 PM
@RahulSoni I think you're a bit quick to dismiss Spark + JDBC. There is actually a solution for the multithreading - Spark will extract the data to different partitions in parallel, just like when your read an HDFS file. You just have to specify the number of partitions in the extracted dataframe and optimize the parameters for your job (number of executors, cores per executor, memory per executor). While sqoop is easier to use out of the box, the fact that it is based on MapReduce will likely mean that Spark is superior in some scenarios, and it should be your go-to option when you want to save the data as Parquet or ORC (not supported by sqoop). I haven't tried yet, but here's someone who seems quite pleased with how Spark worked for them: https://www.all-things-cloud.com/2018/12/spark-jdbc-instead-of-sqoop.html. Sqoop is supposed to support Avro but when I tried to output Avro files it failed with a low-level error in a Java library. I wasn't too impressed by the performance either, although that could be due to bandwidth problems.
... View more