Support Questions
Find answers, ask questions, and share your expertise
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428

Filtering on Phoenix Table by multiple column values

Expert Contributor

I created a Phoenix table with following command:

  mykey_hash varchar not null,
  mykey varchar not null,
  starttime varchar not null,
  endtime varchar,
  value varchar,
CONSTRAINT pk PRIMARY KEY (mykey_hash, mykey, starttime))
SPLIT ON ('00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19')

I filled the table and now I want to read the table using SparkSQL. I need to select on multiple columns (mykey_hash and mykey), and I have many values to filter by.

Let's say I need to filter for the following column values:

mykey_hash	mykey
01		123456.000
01		234567.000
12		585345.233

I created a dataframe for this selection, containing these two colums.

Now I want to select the table data that has these two values in column mykey_hash and mykey.

I'm using the following code:

val dfRead ="table" -> "PHOENIXTABLE", "zkUrl" -> "myserver:2181")).format("org.apache.phoenix.spark").load()
// with listOfKeyHash = [01, 01, 12]
// with listOfKeys = [123456.000, 234567.000, 585345.233]
val dfToJoin"MYKEY_HASH", "MYKEY")
val resultDf = dfRead.join(dfToJoin,Seq("MYKEY_HASH", "MYKEY"))

But this seems to make a full table scan instead of using the values of my primary key to limit the amount of data to scan.

I'm using HDP 2.6.5 with HBase 1.1 and Phoenix 4.7 and Spark 2.3.

Anyone an idea?