Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Filtering on Phoenix Table by multiple column values

Highlighted

Filtering on Phoenix Table by multiple column values

Expert Contributor

I created a Phoenix table with following command:

%jdbc(phoenix)
CREATE TABLE IF NOT EXISTS phoenixtable (
  mykey_hash varchar not null,
  mykey varchar not null,
  starttime varchar not null,
  endtime varchar,
  value varchar,
  ...
CONSTRAINT pk PRIMARY KEY (mykey_hash, mykey, starttime))
SPLIT ON ('00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19')

I filled the table and now I want to read the table using SparkSQL. I need to select on multiple columns (mykey_hash and mykey), and I have many values to filter by.

Let's say I need to filter for the following column values:

mykey_hash	mykey
01		123456.000
01		234567.000
12		585345.233

I created a dataframe for this selection, containing these two colums.

Now I want to select the table data that has these two values in column mykey_hash and mykey.

I'm using the following code:

%spark
val dfRead = sqlContext.read.options(Map("table" -> "PHOENIXTABLE", "zkUrl" -> "myserver:2181")).format("org.apache.phoenix.spark").load()
...
// with listOfKeyHash = [01, 01, 12]
// with listOfKeys = [123456.000, 234567.000, 585345.233]
val dfToJoin =listOfKeyHash.zip(listOfKey).toDF("MYKEY_HASH", "MYKEY")
val resultDf = dfRead.join(dfToJoin,Seq("MYKEY_HASH", "MYKEY"))

But this seems to make a full table scan instead of using the values of my primary key to limit the amount of data to scan.

I'm using HDP 2.6.5 with HBase 1.1 and Phoenix 4.7 and Spark 2.3.

Anyone an idea?