Member since
04-24-2017
106
Posts
13
Kudos Received
7
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2291 | 11-25-2019 12:49 AM |
11-25-2019
12:49 AM
1 Kudo
To answer my own question: Since I'm using multiple partitions for the Kafka topic, Spark uses more executors to process the data. Also Hive/Tez creates as many worker containers as the topic contains partitions.
... View more
11-24-2019
11:18 PM
I wrote a Kafka producer, that sends some simulated data to a Kafka stream (replication-factor 3, one partition).
Now, I want to access this data by using Hive and/or Spark Streaming.
First approach: Using an external Hive table with KafkaStorageHandler:
CREATE EXTERNAL TABLE mydb.kafka_timeseriestest ( description string, version int, ts timestamp, varname string, varvalue float ) STORED BY 'org.apache.hadoop.hive.kafka.KafkaStorageHandler' TBLPROPERTIES ( "kafka.topic" = "testtopic", "kafka.bootstrap.servers"="server1:6667,server2:6667,server3:6667" ); -- e.g. SELECT max(varvalue) from mydb.kafka_timeseriestest; -- takes too long, and only one Tez task is running
Second approach: Writing a Spark Streaming app, that accesses the Kafka topic:
// started with 10 executors, but only one executor is active ... JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(jssc, LocationStrategies.PreferConsistent(), ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)); ...
In both cases, only one Tez/Spark worker is active. Therefore reading all data (~500 million entries) takes a very long time. How can I increase the performance? Is the issue caused by the one-partition topic? If yes, is there a rule of thumb according to which the number of partitions should be determined?
I'm using a HDP 3.1 cluster, running Spark, Hive and Kafka on multiple nodes:
dataNode1 - dataNode3: Hive + Spark + Kafka broker
dataNode4 - dataNode8: Hive + Spark
... View more
Labels:
11-06-2017
12:49 PM
Hi @Bryan Bende this tutorial is really helpful, thank you! For me, everything is working, except the "SPNEGO" part: a) When I open "https://myhost.de:9445/nifi" (without a kinit before), my Kerberos Client asks for authentication (which looks good). When I enter the principal and password, it continues with step b. b) When I already made a "kinit" before opening my browser and entering "https://myhost.de:9445/nifi", I always get the username / password prompt as shown in section "Kerberos Login". What am I missing here? I configured the following settings in my Firefox browser: [Windows only] network.auth.use-sspi = false network.negotiate-auth.delegation-uris = https://myhost.de:9445 network.negotiate-auth.trusted-uris = https://myhost.de:9445 Tested it on Centos7, Ubuntu and Windows, I always get the login screen, instead of skipping it after the "kinit". Can you help?
... View more