I have a DataFrame, like this:
dfFilters
c1_hash || c2
==================
00 || 123
00 || 456
01 || 789
And I have a Phoenix Table, that I read by SparkSQL, like this:
val dfRead = sqlContext.read.options(Map("table" -> "MYPHOENIXTABLE", "zkUrl" -> "myserver:2181")).format("org.apache.phoenix.spark").load()
dfRead.show()
c1_hash || c2 || c3
==========================
00 || 123 || A
00 || 234 || B
...
Now I want to filter my dfRead by the "complete" dfFilters dataframe, which means, that the columns c1_hash and c1 should match the values of dfFilter.
I tried to use a DataFrame.join method, but this seems to make a Phoenix full table scan, the Pushed Filters look like this:
Filter ((C1_HASH#3857 = C1_HASH#3899) && (C1#3858 = C1#3900))
...
Filter (isnotnull(C1#3858) && isnotnull(C1_HASH#3857))
How can I avoid full table scans here?
The key of my MYPHOENIXTABLE is build on the c1_hash and c1 columns. So Phoenix should be able to make a range scan, right?