I have a Spark ETL process which reads from a CSV file into an RDD, performs some transformations and data quality checks, converts this into a DataFrame and pushes the DataFrame into Phoenix using Spark-Phoenix integration. Unfortunately, actually pushing the data to Phoenix is ridiculously slow - the saveAsNewAPIHadoopFile at DataFrameFunctions.scala:55 portion kicks off 120 tasks (which seems quite high for a file of ~3.5GB). Each task then runs between 15 minutes and 1 hour. I have 16 executors with 4GB of RAM each, which should be more than sufficient for the job. The job has currently been running for over two hours and will probably run for another hour or more, which is very long to push only 5.5 million rows.
Does anybody have any insight into a) why this is so slow and b) how to speed it up?
Thanks in advance.
Edit 1: The job has completed in 4 hours.
It is the final step (saveAsNewAPIHadoopFile at DataFrameFunctions.scala:55) that is slow. All the previous steps - two subtracts and distinct - are reasonably fast. However saving the data the Phoenix is the slow part.
Unfortunately the only way I could get this to work was by reverting to batch processing. With Spark streaming it remained very slow.