Support Questions
Find answers, ask questions, and share your expertise

Limiting Phoenix phoenixTableAsRDD and phoenixTableAsDataFrame output to Spark

Super Collaborator

While going through the phoenix_spark hello world examples and having it working I wonder if you can and how you would limit the Phoenix output to the RDD by applying any key predicates to start with.

-Is it possible to define a startKey to the underlying Hbase table scan?

I know I can turn to the jdbc/SQL path but I really don't want to go there since I want to leverage the inherent parallelization of this non-SQL approach.

4 REPLIES 4

Re: Limiting Phoenix phoenixTableAsRDD and phoenixTableAsDataFrame output to Spark

Hi @Jasper. I highly recommend reading the Phoenix documentation on the phoenix-spark API.

Here are the relevant high level details:

Although Spark supports connecting directly to JDBC databases, it’s only able to parallelize queries by partioning on a numeric column. It also requires a known lower bound, upper bound and partition count in order to create split queries.

In contrast, the phoenix-spark integration is able to leverage the underlying splits provided by Phoenix in order to retrieve and save data across multiple workers. All that’s required is a database URL and a table name. Optional SELECT columns can be given, as well as pushdown predicates for efficient filtering.

Re: Limiting Phoenix phoenixTableAsRDD and phoenixTableAsDataFrame output to Spark

Super Collaborator

@Randy Gelhausen

Trust me I know that page. But the thing is there is no example of such a ' pushdown predicate'. Only the way to get only some columns is explained. But my question is about a predicate on the key range for the scan.

Re: Limiting Phoenix phoenixTableAsRDD and phoenixTableAsDataFrame output to Spark

@Jasper

I see your point. For further (external) example of using the Spark Data Source API to pass a predicate, see this comment.

As a predicate, can you try: "where keyCol1 > X and keyCol2 < Y" What does the explain plan say if you issue that filter in a normal JDBC query?

Re: Limiting Phoenix phoenixTableAsRDD and phoenixTableAsDataFrame output to Spark

New Contributor

You can use pridicate caluse to filter

val tblDF = sqlContext.phoenixTableAsDataFrame(srcTable.toUpperCase, Seq(), predicate = Some(whereClause))