Support Questions

Find answers, ask questions, and share your expertise

Spark Streaming / Hive + Kafka: Only one Worker active

avatar
Expert Contributor

I wrote a Kafka producer, that sends some simulated data to a Kafka stream (replication-factor 3, one partition).

 

Now, I want to access this data by using Hive and/or Spark Streaming. 

 

First approach: Using an external Hive table with KafkaStorageHandler:

CREATE EXTERNAL TABLE mydb.kafka_timeseriestest (
description string,
version int,
ts timestamp,
varname string,
varvalue float
)
STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
TBLPROPERTIES (
"kafka.topic" = "testtopic",
"kafka.bootstrap.servers"="server1:6667,server2:6667,server3:6667"
);

-- e.g. SELECT max(varvalue) from mydb.kafka_timeseriestest;
-- takes too long, and only one Tez task is running

 

Second approach: Writing a Spark Streaming app, that accesses the Kafka topic:

// started with 10 executors, but only one executor is active

...
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
...

 

In both cases, only one Tez/Spark worker is active. Therefore reading all data (~500 million entries) takes a very long time. How can I increase the performance? Is the issue caused by the one-partition topic? If yes, is there a rule of thumb according to which the number of partitions should be determined?

 

I'm using a HDP 3.1 cluster, running Spark, Hive and Kafka on multiple nodes:

  • dataNode1 - dataNode3: Hive + Spark + Kafka broker
  • dataNode4 - dataNode8: Hive + Spark 
1 ACCEPTED SOLUTION

avatar
Expert Contributor

To answer my own question: Since I'm using multiple partitions for the Kafka topic, Spark uses more executors to process the data. Also Hive/Tez creates as many worker containers as the topic contains partitions.

View solution in original post

1 REPLY 1

avatar
Expert Contributor

To answer my own question: Since I'm using multiple partitions for the Kafka topic, Spark uses more executors to process the data. Also Hive/Tez creates as many worker containers as the topic contains partitions.