Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Phoenix and SparkSQL: Predicate pushdown with dataframe

Highlighted

Phoenix and SparkSQL: Predicate pushdown with dataframe

Expert Contributor

I have a huge Phoenix Table taht I have to select by multiple columns and multiple rows.

The table looks like this:

STARTNO		|| STARTSTEP	|| STARTTIME	|| OP	|| UNIT	|| ...
=======================================================================
123456		|| AB		|| 2017-01-01	|| TEST	|| MPH	|| ...
898473		|| AB		|| 2017-01-02	|| TEST	|| N	|| ...
...

Since the primary key for this Phoenix table consists of [startno, startstep, starttime], selecting on startno is very fast.

Now to my problem: I need to select data from this table, which matches the startno AND startstep value, let's say I want to select by 3 startno values and 2 startstep values.

Therefore I created a new dataframe with this two columns, like this:

STARTNO		|| STARTSTEP
==============================
123456		|| AB
234567		|| AB
345678		|| AB
123456		|| BC
234567		|| BC
345678		|| BC

My plan was to join the "Phoenix dataframe" with this "selection dataframe" for getting the data also in a fast way, because both columns are in the primary key of the table.

But when I call the following statement, SparkSQL / Phoenix seems to make a full table scan, and predicate pushdown doesn't work:

val dfResult = dfPhoenixTable.join(filterDf,Seq("STARTNO", "STARTSTEP"))
dfResult.show()

When I look at the execution plan, I can only find isnotnull filters:

+- Filter (isnotnull(_2#7245) && isnotnull(_1#7244))

How can I give Phoenix more than one column for selection and avoid a full table scan? Thank you!

I'm using HDP 2.6.5 with Spark 2.3 and Phoenix 4.7 (HBase 1.1). I'm reading the Phoenix Table by the Phoenix-Spark Plugin.

Don't have an account?
Coming from Hortonworks? Activate your account here