<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark Streaming / Hive + Kafka: Only one Worker active in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Spark-Streaming-Hive-Kafka-Only-one-Worker-active/m-p/283938#M210908</link>
    <description>&lt;P&gt;To answer my own question: Since I'm using multiple partitions for the Kafka topic, Spark uses more executors to process the data. Also Hive/Tez creates as many worker containers as the topic contains partitions.&lt;/P&gt;</description>
    <pubDate>Mon, 25 Nov 2019 08:49:58 GMT</pubDate>
    <dc:creator>dmueller1607</dc:creator>
    <dc:date>2019-11-25T08:49:58Z</dc:date>
    <item>
      <title>Spark Streaming / Hive + Kafka: Only one Worker active</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Streaming-Hive-Kafka-Only-one-Worker-active/m-p/283931#M210902</link>
      <description>&lt;P&gt;I wrote a Kafka producer, that sends some simulated data to a Kafka stream (replication-factor 3, one partition).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Now, I want to access this data by using Hive and/or Spark Streaming.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;First approach: Using an external Hive table with KafkaStorageHandler:&lt;/P&gt;
&lt;PRE&gt;CREATE EXTERNAL TABLE mydb.kafka_timeseriestest (&lt;BR /&gt;  description string,&lt;BR /&gt;  version int,&lt;BR /&gt;  ts timestamp,&lt;BR /&gt;  varname string,&lt;BR /&gt;  varvalue float&lt;BR /&gt;)&lt;BR /&gt;STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'&lt;BR /&gt;TBLPROPERTIES (&lt;BR /&gt;  "kafka.topic" = "testtopic", &lt;BR /&gt;  "kafka.bootstrap.servers"="server1:6667,server2:6667,server3:6667"&lt;BR /&gt;);&lt;BR /&gt;&lt;BR /&gt;-- e.g. SELECT max(varvalue) from mydb.kafka_timeseriestest; &lt;BR /&gt;-- takes too long, and only one Tez task is running&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Second approach: Writing a Spark Streaming app, that accesses the Kafka topic:&lt;/P&gt;
&lt;PRE&gt;// started with 10 executors, but only one executor is active&lt;BR /&gt;&lt;BR /&gt;...&lt;BR /&gt;JavaInputDStream&amp;lt;ConsumerRecord&amp;lt;String, String&amp;gt;&amp;gt; stream = KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.&amp;lt;String, String&amp;gt;Subscribe(topics, kafkaParams));&lt;BR /&gt;...&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In both cases, only one Tez/Spark worker is active. Therefore reading all data (~500 million entries) takes a very long time. How can I increase the performance? Is the issue caused by the one-partition topic? If yes, is there a rule of thumb according to which the number of partitions should be determined?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'm using a HDP 3.1 cluster, running Spark, Hive and Kafka on multiple nodes:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;dataNode1 - dataNode3: Hive + Spark + Kafka broker&lt;/LI&gt;
&lt;LI&gt;dataNode4 - dataNode8: Hive + Spark&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Mon, 25 Nov 2019 13:47:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Streaming-Hive-Kafka-Only-one-Worker-active/m-p/283931#M210902</guid>
      <dc:creator>dmueller1607</dc:creator>
      <dc:date>2019-11-25T13:47:47Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming / Hive + Kafka: Only one Worker active</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Spark-Streaming-Hive-Kafka-Only-one-Worker-active/m-p/283938#M210908</link>
      <description>&lt;P&gt;To answer my own question: Since I'm using multiple partitions for the Kafka topic, Spark uses more executors to process the data. Also Hive/Tez creates as many worker containers as the topic contains partitions.&lt;/P&gt;</description>
      <pubDate>Mon, 25 Nov 2019 08:49:58 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Spark-Streaming-Hive-Kafka-Only-one-Worker-active/m-p/283938#M210908</guid>
      <dc:creator>dmueller1607</dc:creator>
      <dc:date>2019-11-25T08:49:58Z</dc:date>
    </item>
  </channel>
</rss>

