<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Kudu scan maximize throughput via Spark in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Kudu-scan-maximize-throughput-via-Spark/m-p/63834#M73685</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;can somebody give a hint or guideline how to maximize the Kudu scan (read from kudu table) performance from Spark? I tried a simple dataframe read, tried also to create multiple data frames, where each had different filters on one of the column in the primary key columns, and then union the dataframes and write to HDFS but it seems to me that the Tablet server is handling out the data via one scanner, so there are 5 tablet servers, 5 scanners and 5 tasks in 5 execturos.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is it possible to trigger more scanners via spark?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 12:45:32 GMT</pubDate>
    <dc:creator>Tomas79</dc:creator>
    <dc:date>2022-09-16T12:45:32Z</dc:date>
    <item>
      <title>Kudu scan maximize throughput via Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Kudu-scan-maximize-throughput-via-Spark/m-p/63834#M73685</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;can somebody give a hint or guideline how to maximize the Kudu scan (read from kudu table) performance from Spark? I tried a simple dataframe read, tried also to create multiple data frames, where each had different filters on one of the column in the primary key columns, and then union the dataframes and write to HDFS but it seems to me that the Tablet server is handling out the data via one scanner, so there are 5 tablet servers, 5 scanners and 5 tasks in 5 execturos.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is it possible to trigger more scanners via spark?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 12:45:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Kudu-scan-maximize-throughput-via-Spark/m-p/63834#M73685</guid>
      <dc:creator>Tomas79</dc:creator>
      <dc:date>2022-09-16T12:45:32Z</dc:date>
    </item>
    <item>
      <title>Re: Kudu scan maximize throughput via Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Kudu-scan-maximize-throughput-via-Spark/m-p/63935#M73686</link>
      <description>&lt;P&gt;Hi Tomas,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The kudu-spark integration will create&amp;nbsp; one task/executor per Kudu tablet, each with a single scanner.&amp;nbsp; If you want to achieve more parallelism you can add more tablets/partitions to the Kudu table.&lt;/P&gt;</description>
      <pubDate>Tue, 23 Jan 2018 19:41:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Kudu-scan-maximize-throughput-via-Spark/m-p/63935#M73686</guid>
      <dc:creator>Dan Burkert</dc:creator>
      <dc:date>2018-01-23T19:41:13Z</dc:date>
    </item>
    <item>
      <title>Re: Kudu scan maximize throughput via Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Kudu-scan-maximize-throughput-via-Spark/m-p/69204#M73687</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Im trying to access kudu through impala and spark and it seems scan through Impala is 5-6 times faster than spark.&amp;nbsp;Through impala its taking 2.5 mins to scan the kudu table where as its taking 18 mins to scan the kudu table through spark.&lt;/P&gt;&lt;P&gt;I would like to learn the reason for this.&lt;/P&gt;</description>
      <pubDate>Tue, 19 Jun 2018 07:51:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Kudu-scan-maximize-throughput-via-Spark/m-p/69204#M73687</guid>
      <dc:creator>arvindkv</dc:creator>
      <dc:date>2018-06-19T07:51:08Z</dc:date>
    </item>
    <item>
      <title>Re: Kudu scan maximize throughput via Spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Kudu-scan-maximize-throughput-via-Spark/m-p/69205#M73688</link>
      <description>&lt;P&gt;You did not mentioned the version of CDH. But I think the problem is that spark launches many executors to read, and those executors are not co-located with the Kudu tablet servers.&lt;/P&gt;&lt;P&gt;I dont know if you are just reading/filtering the data, or reading and writing into parquet - it depends how the spark job is executed.&lt;/P&gt;&lt;P&gt;What I also noticed, that running multiple spark jobs agains the same table (with different partitions) did not help either.&lt;/P&gt;</description>
      <pubDate>Tue, 19 Jun 2018 09:00:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Kudu-scan-maximize-throughput-via-Spark/m-p/69205#M73688</guid>
      <dc:creator>Tomas79</dc:creator>
      <dc:date>2018-06-19T09:00:52Z</dc:date>
    </item>
  </channel>
</rss>

