Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark-Phoenix Integration very slow

Highlighted

Spark-Phoenix Integration very slow

Contributor

I have a Spark ETL process which reads from a CSV file into an RDD, performs some transformations and data quality checks, converts this into a DataFrame and pushes the DataFrame into Phoenix using Spark-Phoenix integration. Unfortunately, actually pushing the data to Phoenix is ridiculously slow - the saveAsNewAPIHadoopFile at DataFrameFunctions.scala:55 portion kicks off 120 tasks (which seems quite high for a file of ~3.5GB). Each task then runs between 15 minutes and 1 hour. I have 16 executors with 4GB of RAM each, which should be more than sufficient for the job. The job has currently been running for over two hours and will probably run for another hour or more, which is very long to push only 5.5 million rows.

Does anybody have any insight into a) why this is so slow and b) how to speed it up?

Thanks in advance.

Edit 1: The job has completed in 4 hours.

4 REPLIES 4

Re: Spark-Phoenix Integration very slow

Where is the job taking time?

Re: Spark-Phoenix Integration very slow

Contributor

It is the final step (saveAsNewAPIHadoopFile at DataFrameFunctions.scala:55) that is slow. All the previous steps - two subtracts and distinct - are reasonably fast. However saving the data the Phoenix is the slow part.

Re: Spark-Phoenix Integration very slow

New Contributor

@Mark Heydenrych I have got the same issue, would you able to find some alternate/solution?

Re: Spark-Phoenix Integration very slow

Contributor

Unfortunately the only way I could get this to work was by reverting to batch processing. With Spark streaming it remained very slow.

Don't have an account?
Coming from Hortonworks? Activate your account here