Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Very Slow Spark-Phoenix Integration

Very Slow Spark-Phoenix Integration

Contributor

I am using spark-phoenix integration to load data from a dataframe into a Phoenix table. Unfortunately this is ridiculously slow - pushing 23 rows of 25 columns each takes 7-8 seconds. This is with two executors, meaning it's actually twice as slow. This makes it unusable in my case, since it is planned to use in a streaming application - the number of records dropped in a 15 second window take up to a minute to load.

When I look at the Spark History Server, I see two really strange things:

Does anybody have experience with how I could improve these loading speeds? Ideally I want to keep using Phoenix in some way, since we have secondary indexes on the table.

4 REPLIES 4

Re: Very Slow Spark-Phoenix Integration

Expert Contributor

Hello by Spark-Phoenix integration do you mean phonenix spark?

From my experience with phoenix spark it should not be this slow. What version are you using?

For the number of tasks - are you loading into phoenix table with salt bucket = 200?

Re: Very Slow Spark-Phoenix Integration

Contributor

Hi. Yes, that is what I mean. I am using Spark 1.6.2 and Phoenix 4.7. The Phoenix table has no salt buckets, and only 1 region.

Re: Very Slow Spark-Phoenix Integration

Expert Contributor

@Mark Heydenrych

this is strange - could it be that your dataframe (I supposed you are using dataframe?) is a product of some repartition / shuffle op. somewhere up the DAG?

from my exp. phoenix spark break down the partitions into # of salt bucket, which is also the same as # of region in HBase? (correct me if I am wrong)

Highlighted

Re: Very Slow Spark-Phoenix Integration

Contributor

As I understand you're correct, the number of partitions should match the number of regions. However I discovered that my dataframe was defaulting to 200 partitions, even though it comes from an RDD with only 1 partition. When I coalesce into fewer partitions it doesn't significantly improve performance.

Don't have an account?
Coming from Hortonworks? Activate your account here