Can Spark be a good solution to load csv,xml data from filesystem(local FS and not HDFS) and insert/update data in MySQL.?
As per my knowledge on Spark, its mostly used in opposite scenario where data is taken up from MySQL to perform analysis.
If not, what other technology/solution can perform better for this use case ?
Thanks in advance.
Hive table with ACID property enabled table will work good for your use case. Also with the help of Merge update you will be able to perform these inserts/updates easily. In typing data warehousing implementing SCD type 2 tables are performed easily with the help of Merge update function in hive.
To answer whether spark in terms of insert/updates you can complex computation with much ease. But when it comes to updating the existing record its better to go with Hive than spark. Hope it Helps!!
Thanks @Bala Vignesh N V for prompt response.
Actually, the database MySQL won't be replaced as of now.
Hence, need options to use basically a integrator that would pull files from ftp and push (UPSERT) data into MySQL tables post few transformations.
Understanding the problem statement here, you just have small files on the local file system and do not want to use any HDFS layer in here. Spark can, but may be an over kill for this scenario.
You can start with basic java jdbc code to get this task accomplished. However, you have Apache NiFi within the HDF stack that can easily solve this problem for you.
You can simply design a dataflow that will monitor for new files into your SFTP location, pull those files to NiFi as soon as they land, optionally parse them/apply some transformations, perform a mysql operation.
You can use this as a reference.
Additionally, NiFi can also be used as a powerful MySQL CDC options. You can see more details on this 3 part article.
Thanks @Arun A K for this response.
Was not much aware about how NiFi would fit this use case. Will surely follow articles and try to implement.
So, can we say that for this use case NiFi wins over Spark when it comes to performance since we have to pull files, transform some data and then insert/update that data in MySQL ?
We would not make a performance comparison of Spark and NiFi in here for this usecase since we aren't talking even running into a cluster. Since the data volume is low and processing is finite, I would prefer NiFi to be used since the entire life cycle gets reduced. + it is UI driven.
Architecture wise it is not a good option as spark is distributed processing framework (running on multiple nodes) and spark writing to MYSQL on a single node will create performance bottleneck as MYSQL is not distributed.If you want to make full use of Sparks distributed processing capability then it is advisable to use either distributed file system or distributed database.