Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Phoenix-Spark filter by DataFrame

Highlighted

Phoenix-Spark filter by DataFrame

Expert Contributor

I have a DataFrame, like this:

dfFilters

c1_hash  ||  c2
==================
00	 ||  123
00	 ||  456
01	 ||  789

And I have a Phoenix Table, that I read by SparkSQL, like this:

val dfRead = sqlContext.read.options(Map("table" -> "MYPHOENIXTABLE", "zkUrl" -> "myserver:2181")).format("org.apache.phoenix.spark").load()

dfRead.show()

c1_hash	 ||  c2	  || c3
==========================
00	 ||  123  || A
00	 ||  234  || B
...

Now I want to filter my dfRead by the "complete" dfFilters dataframe, which means, that the columns c1_hash and c1 should match the values of dfFilter.

I tried to use a DataFrame.join method, but this seems to make a Phoenix full table scan, the Pushed Filters look like this:

Filter ((C1_HASH#3857 = C1_HASH#3899) && (C1#3858 = C1#3900))
...
Filter (isnotnull(C1#3858) && isnotnull(C1_HASH#3857))

How can I avoid full table scans here?

The key of my MYPHOENIXTABLE is build on the c1_hash and c1 columns. So Phoenix should be able to make a range scan, right?

Don't have an account?
Coming from Hortonworks? Activate your account here