<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Read a random sample of data in Apache Spark from Phoenix table in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Read-a-random-sample-of-data-in-Apache-Spark-from-Phoenix/m-p/375187#M242329</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I want to read only a finite sample from an Apache Phoenix table into Spark dataframe without using primary keys in where condition.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried using 'limit' clause in query as shown in the code below.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;Map&amp;lt;String, String&amp;gt; map = new HashMap&amp;lt;&amp;gt;();&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;map .put("url", url );&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;map .put("driver", "org.apache.phoenix.jdbc.PhoenixDriver");&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;map .put("query", "select * from table1 &lt;STRONG&gt;limit 100&lt;/STRONG&gt;");&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Dataset&amp;lt;Row&amp;gt; df= spark.read().format("jdbc").options(map).load();&lt;/SPAN&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Following exception has occurred.&lt;/P&gt;&lt;P&gt;java.sql.SQLFeatureNotSupportedException: Wildcard in subqueries not supported. at org.apache.phoenix.compile.FromCompiler&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then if I use limit() method of dataframe as shown below.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;map .put("query", "select * from table1");&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Dataset&amp;lt;Row&amp;gt; df = spark.read().format("jdbc").options(map).load().&lt;STRONG&gt;limit(100)&lt;/STRONG&gt;;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In this case, spark is first reading all the data into dataframe then it is trying to filter the data. The mentioned 'table1' has millions of rows. I am getting timeout exception.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;org.apache.phoenix.exception.PhoenixIOException: callTimeout&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, I want to read a sample of few records from Phoenix table in Apache Spark such that data filtering happens at the server side.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can anyone please help in this?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 14 Aug 2023 13:42:15 GMT</pubDate>
    <dc:creator>vaibhavgokhale</dc:creator>
    <dc:date>2023-08-14T13:42:15Z</dc:date>
    <item>
      <title>Read a random sample of data in Apache Spark from Phoenix table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Read-a-random-sample-of-data-in-Apache-Spark-from-Phoenix/m-p/375187#M242329</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I want to read only a finite sample from an Apache Phoenix table into Spark dataframe without using primary keys in where condition.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried using 'limit' clause in query as shown in the code below.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;Map&amp;lt;String, String&amp;gt; map = new HashMap&amp;lt;&amp;gt;();&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;map .put("url", url );&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;map .put("driver", "org.apache.phoenix.jdbc.PhoenixDriver");&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;map .put("query", "select * from table1 &lt;STRONG&gt;limit 100&lt;/STRONG&gt;");&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Dataset&amp;lt;Row&amp;gt; df= spark.read().format("jdbc").options(map).load();&lt;/SPAN&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Following exception has occurred.&lt;/P&gt;&lt;P&gt;java.sql.SQLFeatureNotSupportedException: Wildcard in subqueries not supported. at org.apache.phoenix.compile.FromCompiler&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then if I use limit() method of dataframe as shown below.&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;map .put("query", "select * from table1");&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Dataset&amp;lt;Row&amp;gt; df = spark.read().format("jdbc").options(map).load().&lt;STRONG&gt;limit(100)&lt;/STRONG&gt;;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In this case, spark is first reading all the data into dataframe then it is trying to filter the data. The mentioned 'table1' has millions of rows. I am getting timeout exception.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;org.apache.phoenix.exception.PhoenixIOException: callTimeout&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, I want to read a sample of few records from Phoenix table in Apache Spark such that data filtering happens at the server side.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can anyone please help in this?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 14 Aug 2023 13:42:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Read-a-random-sample-of-data-in-Apache-Spark-from-Phoenix/m-p/375187#M242329</guid>
      <dc:creator>vaibhavgokhale</dc:creator>
      <dc:date>2023-08-14T13:42:15Z</dc:date>
    </item>
    <item>
      <title>Re: Read a random sample of data in Apache Spark from Phoenix table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Read-a-random-sample-of-data-in-Apache-Spark-from-Phoenix/m-p/375316#M242388</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/105765"&gt;@vaibhavgokhale&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It is not a recommend way to get the data from Phoenix using Spark Jdbc [1]. Try to use Phoenix Spark Connector API [2].&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Reference:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;1.&amp;nbsp;&lt;A href="https://phoenix.apache.org/phoenix_spark.html" target="_blank"&gt;https://phoenix.apache.org/phoenix_spark.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;2.&amp;nbsp;&lt;A href="https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/phoenix-access-data/topics/phoenix-understanding-spark-connector.html" target="_blank"&gt;https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/phoenix-access-data/topics/phoenix-understanding-spark-connector.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 17 Aug 2023 10:31:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Read-a-random-sample-of-data-in-Apache-Spark-from-Phoenix/m-p/375316#M242388</guid>
      <dc:creator>RangaReddy</dc:creator>
      <dc:date>2023-08-17T10:31:41Z</dc:date>
    </item>
    <item>
      <title>Re: Read a random sample of data in Apache Spark from Phoenix table</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Read-a-random-sample-of-data-in-Apache-Spark-from-Phoenix/m-p/375400#M242441</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/105765"&gt;@vaibhavgokhale&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Please accept the answer if you satisfied with above solution.&lt;/P&gt;</description>
      <pubDate>Sun, 20 Aug 2023 16:01:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Read-a-random-sample-of-data-in-Apache-Spark-from-Phoenix/m-p/375400#M242441</guid>
      <dc:creator>RangaReddy</dc:creator>
      <dc:date>2023-08-20T16:01:53Z</dc:date>
    </item>
  </channel>
</rss>

