Read a random sample of data in Apache Spark from Phoenix table

vaibhavgokhale — Mon, 14 Aug 2023 13:42:15 GMT

Hi,

I want to read only a finite sample from an Apache Phoenix table into Spark dataframe without using primary keys in where condition.

I tried using 'limit' clause in query as shown in the code below.

Map<String, String> map = new HashMap<>();

map .put("url", url );

map .put("driver", "org.apache.phoenix.jdbc.PhoenixDriver");

map .put("query", "select * from table1 limit 100");

Dataset<Row> df= spark.read().format("jdbc").options(map).load();

Following exception has occurred.

java.sql.SQLFeatureNotSupportedException: Wildcard in subqueries not supported. at org.apache.phoenix.compile.FromCompiler

Then if I use limit() method of dataframe as shown below.

map .put("query", "select * from table1");

Dataset<Row> df = spark.read().format("jdbc").options(map).load().limit(100);

In this case, spark is first reading all the data into dataframe then it is trying to filter the data. The mentioned 'table1' has millions of rows. I am getting timeout exception.

org.apache.phoenix.exception.PhoenixIOException: callTimeout

So, I want to read a sample of few records from Phoenix table in Apache Spark such that data filtering happens at the server side.

Can anyone please help in this?

Re: Read a random sample of data in Apache Spark from Phoenix table

RangaReddy — Thu, 17 Aug 2023 10:31:41 GMT

Hi @vaibhavgokhale

It is not a recommend way to get the data from Phoenix using Spark Jdbc [1]. Try to use Phoenix Spark Connector API [2].

Reference:

1. https://phoenix.apache.org/phoenix_spark.html

2. https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/phoenix-access-data/topics/phoenix-understanding-spark-connector.html

Re: Read a random sample of data in Apache Spark from Phoenix table

RangaReddy — Sun, 20 Aug 2023 16:01:53 GMT

Hi @vaibhavgokhale

Please accept the answer if you satisfied with above solution.

question Read a random sample of data in Apache Spark from Phoenix table in Support Questions

Read a random sample of data in Apache Spark from Phoenix table

Re: Read a random sample of data in Apache Spark from Phoenix table

Re: Read a random sample of data in Apache Spark from Phoenix table