I am using spark-phoenix integration to load data from a dataframe into a Phoenix table. Unfortunately this is ridiculously slow - pushing 23 rows of 25 columns each takes 7-8 seconds. This is with two executors, meaning it's actually twice as slow. This makes it unusable in my case, since it is planned to use in a streaming application - the number of records dropped in a 15 second window take up to a minute to load.
When I look at the Spark History Server, I see two really strange things:
Does anybody have experience with how I could improve these loading speeds? Ideally I want to keep using Phoenix in some way, since we have secondary indexes on the table.
Hello by Spark-Phoenix integration do you mean phonenix spark?
From my experience with phoenix spark it should not be this slow. What version are you using?
For the number of tasks - are you loading into phoenix table with salt bucket = 200?
Hi. Yes, that is what I mean. I am using Spark 1.6.2 and Phoenix 4.7. The Phoenix table has no salt buckets, and only 1 region.
this is strange - could it be that your dataframe (I supposed you are using dataframe?) is a product of some repartition / shuffle op. somewhere up the DAG?
As I understand you're correct, the number of partitions should match the number of regions. However I discovered that my dataframe was defaulting to 200 partitions, even though it comes from an RDD with only 1 partition. When I coalesce into fewer partitions it doesn't significantly improve performance.