Support Questions
Find answers, ask questions, and share your expertise

Phoenix-Spark filter by DataFrame

Phoenix-Spark filter by DataFrame

Expert Contributor

I have a DataFrame, like this:


c1_hash  ||  c2
00	 ||  123
00	 ||  456
01	 ||  789

And I have a Phoenix Table, that I read by SparkSQL, like this:

val dfRead ="table" -> "MYPHOENIXTABLE", "zkUrl" -> "myserver:2181")).format("org.apache.phoenix.spark").load()

c1_hash	 ||  c2	  || c3
00	 ||  123  || A
00	 ||  234  || B

Now I want to filter my dfRead by the "complete" dfFilters dataframe, which means, that the columns c1_hash and c1 should match the values of dfFilter.

I tried to use a DataFrame.join method, but this seems to make a Phoenix full table scan, the Pushed Filters look like this:

Filter ((C1_HASH#3857 = C1_HASH#3899) && (C1#3858 = C1#3900))
Filter (isnotnull(C1#3858) && isnotnull(C1_HASH#3857))

How can I avoid full table scans here?

The key of my MYPHOENIXTABLE is build on the c1_hash and c1 columns. So Phoenix should be able to make a range scan, right?