Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Spark-Phoenix Integration very slow

Rising Star

I have a Spark ETL process which reads from a CSV file into an RDD, performs some transformations and data quality checks, converts this into a DataFrame and pushes the DataFrame into Phoenix using Spark-Phoenix integration. Unfortunately, actually pushing the data to Phoenix is ridiculously slow - the saveAsNewAPIHadoopFile at DataFrameFunctions.scala:55 portion kicks off 120 tasks (which seems quite high for a file of ~3.5GB). Each task then runs between 15 minutes and 1 hour. I have 16 executors with 4GB of RAM each, which should be more than sufficient for the job. The job has currently been running for over two hours and will probably run for another hour or more, which is very long to push only 5.5 million rows.

Does anybody have any insight into a) why this is so slow and b) how to speed it up?

Thanks in advance.

Edit 1: The job has completed in 4 hours.

4 REPLIES 4

Where is the job taking time?

Rising Star

It is the final step (saveAsNewAPIHadoopFile at DataFrameFunctions.scala:55) that is slow. All the previous steps - two subtracts and distinct - are reasonably fast. However saving the data the Phoenix is the slow part.

Explorer

@Mark Heydenrych I have got the same issue, would you able to find some alternate/solution?

Rising Star

Unfortunately the only way I could get this to work was by reverting to batch processing. With Spark streaming it remained very slow.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.