I am considering sqoop to import/export data from RDBMS to/from HDFS. I found following issues with sqoop
it's still using Map Reduce as execution engine which is slowly dying creating
a number of mappers to speed up the execution is a tiresome process. Find a column which could be evenly distributed is not easy when you don't have a primary key (Netezza) or it's a combination of two columns
Sqoop is still the best tool around when it comes to fetching data from RDBMS to HDFS. Talking about issues listed by you one by one.
1. Sqoop uses MapReduce which is slow.
First, Sqoop spawns a Map Only job. And the sole operation by mapper is using the JDBC connector to connect to your RDBMS and fetch the data. Second, the data fetch use cases Sqoop is supposed to be used for are supposed to be batch oriented. Hence it's still handsomely efficient in doing whatever it does. Also, it is a tried and tested tool which has matured over a period of time.
2. Setting up the number of mappers is a one-time effort. Yes, you may have to put some efforts to find the best column for your data fetch operations but this effort will pay off over time with an efficient data transfer over JDBC connection 🙂
3. Stick to Sqoop 1. No major distribution has yet opted for Sqoop 2.
4. Even spark transfer data from RDBMS using JDBC connector. Which is exactly similar to what Sqoop is doing over all those years. And in Spark, you would need to distribute the data load manually, if at all you wanted to, which you can very easily achieve in Sqoop by simply mentioning -m/--num-mappers.