Support Questions

cjervis · ‎11-24-2019

I wrote a Kafka producer, that sends some simulated data to a Kafka stream (replication-factor 3, one partition).

Now, I want to access this data by using Hive and/or Spark Streaming.

First approach: Using an external Hive table with KafkaStorageHandler:

CREATE EXTERNAL TABLE mydb.kafka_timeseriestest (
  description string,
  version int,
  ts timestamp,
  varname string,
  varvalue float
)
STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
TBLPROPERTIES (
  "kafka.topic" = "testtopic", 
  "kafka.bootstrap.servers"="server1:6667,server2:6667,server3:6667"
);

-- e.g. SELECT max(varvalue) from mydb.kafka_timeseriestest; 
-- takes too long, and only one Tez task is running

Second approach: Writing a Spark Streaming app, that accesses the Kafka topic:

// started with 10 executors, but only one executor is active

...
JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams));
...

In both cases, only one Tez/Spark worker is active. Therefore reading all data (~500 million entries) takes a very long time. How can I increase the performance? Is the issue caused by the one-partition topic? If yes, is there a rule of thumb according to which the number of partitions should be determined?

I'm using a HDP 3.1 cluster, running Spark, Hive and Kafka on multiple nodes:

dataNode1 - dataNode3: Hive + Spark + Kafka broker
dataNode4 - dataNode8: Hive + Spark

dmueller1607 · ‎11-25-2019

To answer my own question: Since I'm using multiple partitions for the Kafka topic, Spark uses more executors to process the data. Also Hive/Tez creates as many worker containers as the topic contains partitions.

View solution in original post

dmueller1607 · ‎11-25-2019

To answer my own question: Since I'm using multiple partitions for the Kafka topic, Spark uses more executors to process the data. Also Hive/Tez creates as many worker containers as the topic contains partitions.

Cloudera Community

Support Questions

Spark Streaming / Hive + Kafka: Only one Worker active