Created 07-18-2017 12:14 PM
I am using spark-phoenix integration to load data from a dataframe into a Phoenix table. Unfortunately this is ridiculously slow - pushing 23 rows of 25 columns each takes 7-8 seconds. This is with two executors, meaning it's actually twice as slow. This makes it unusable in my case, since it is planned to use in a streaming application - the number of records dropped in a 15 second window take up to a minute to load.
When I look at the Spark History Server, I see two really strange things:
Does anybody have experience with how I could improve these loading speeds? Ideally I want to keep using Phoenix in some way, since we have secondary indexes on the table.
Created 07-19-2017 08:38 AM
Hello by Spark-Phoenix integration do you mean phonenix spark?
From my experience with phoenix spark it should not be this slow. What version are you using?
For the number of tasks - are you loading into phoenix table with salt bucket = 200?
Created 07-19-2017 08:48 AM
Hi. Yes, that is what I mean. I am using Spark 1.6.2 and Phoenix 4.7. The Phoenix table has no salt buckets, and only 1 region.
Created 07-19-2017 03:29 PM
this is strange - could it be that your dataframe (I supposed you are using dataframe?) is a product of some repartition / shuffle op. somewhere up the DAG?
from my exp. phoenix spark break down the partitions into # of salt bucket, which is also the same as # of region in HBase? (correct me if I am wrong)
Created 07-20-2017 01:27 PM
As I understand you're correct, the number of partitions should match the number of regions. However I discovered that my dataframe was defaulting to 200 partitions, even though it comes from an RDD with only 1 partition. When I coalesce into fewer partitions it doesn't significantly improve performance.