While going through the phoenix_spark hello world examples and having it working I wonder if you can and how you would limit the Phoenix output to the RDD by applying any key predicates to start with.
-Is it possible to define a startKey to the underlying Hbase table scan?
I know I can turn to the jdbc/SQL path but I really don't want to go there since I want to leverage the inherent parallelization of this non-SQL approach.
Here are the relevant high level details:
Although Spark supports connecting directly to JDBC databases, it’s only able to parallelize queries by partioning on a numeric column. It also requires a known lower bound, upper bound and partition count in order to create split queries. In contrast, the phoenix-spark integration is able to leverage the underlying splits provided by Phoenix in order to retrieve and save data across multiple workers. All that’s required is a database URL and a table name. Optional SELECT columns can be given, as well as pushdown predicates for efficient filtering.
Trust me I know that page. But the thing is there is no example of such a ' pushdown predicate'. Only the way to get only some columns is explained. But my question is about a predicate on the key range for the scan.
You can use pridicate caluse to filter
val tblDF = sqlContext.phoenixTableAsDataFrame(srcTable.toUpperCase, Seq(), predicate = Some(whereClause))