What is the suggested method to extract data from SQL server to hdfs? Is there any performance advantage in using spark jdbc over sqoop import ?
Also as per apache.org, apache sqoop project is retired as of June 2021, Is it better to go with the Spark approach in new ETL pipelines developments?
Other community members may weigh in with their opinions, but I believe the answer to the first and last questions is, of course, "it depends on the job".
The most suitable use cases for Sqoop center on bulk structured data transfer between RDBMSs and HDFS. Sqoop takes the commands you provide at the CLI and internally generates MapReduce tasks to execute your desired data movement, with HDFS as either the source or destination.
While you can do some of the same types of (simple) things with either Spark or Sqoop, Spark and Sqoop are not interchangeable tools. You can do a lot more with Spark than you can with Sqoop because Spark gives you a full-blown programming language (Scala) along with a set of libraries that support a fairly complete distributed processing framework. The "T" part of ETL is going to be a lot easier to tackle in Spark than by using Sqoop and you will probably encounter tasks that are nearly impossible to complete with Sqoop that are fairly straightforward to address in Spark code, assuming that you have the requisite software development background.
While I have not done any performance comparisons between a batch job in Sqoop to an equivalent job written in Spark (and I haven't read anybody else's work on that topic), just engaging in a bit of logical deduction from first principles would lead me to expect a considerable performance advantage in using Spark over Sqoop import (given a sufficiently large data set) because in Spark you can leverage in-memory processing capabilities which should, in theory, out perform MapReduce.
Yes, unfortunately the Sqoop PMC voted in June to retire Sqoop and move the responsibility for its oversight to the Attic. That does not mean that Apache Sqoop as a tool has lost all value. Cloudera still ships it as part of Cloudera Runtime, still fully supports Sqoop and responds to new feature requests coming from customers, and there's no plan to change this. The change in status at Apache could mean that the software has reached maturity "as is" and still has its uses. But end-user development of a complete new ETL pipeline is probably not one of them.
I am following up here on one part of the original question, regarding the Apache Sqoop project being retired, just for the record (which is to say for the benefit of people who might arrive at this thread via search engine at some point in the near-to-medium term future).
I still stand behind what I previously wrote about the relative strengths/weaknesses of using these two tools to extract data from SQL server and ingesting it to hdfs, but I do want to clarify that while Sqoop was moved to the Apache Attic in June 2021, the software will continue to be supported by Cloudera and shipped as part of CDP Public Cloud and CDP Private Cloud.
See Cloudera's statement of support on this matter here: