Created 03-11-2019 03:32 AM
I meet a significant performance problem recently.
I have about 30 spark streaming applications, which read data from kafka and write the data to hdfs. But recently the writing progress on some spark executor become very slow. The data amount for each spark tasks are similar, but the time cost of tasks are in great difference, where the slowest one is about 4 times of the fastest one.
I have checked the disk usage, where disk use time on some hosts are about 80% to 90%.
So I guess if it is caused by slow hdfs writing speed, because of my kafka broker, hdfs data node, yarn nodemanager locating on same hosts.
So will it actually affect the performance?
Created 03-11-2019 07:35 PM
"Can they" - Yes.
"Should they" - I would say no.
Kafka is very memory and disk sensitive. Depending on your use of it, it could even use more I/O than the combination of the DataNode and NodeManager on the same machine.
Personally, I would recommend installing Kafka brokers on dedicated hardware, even separate from the Zookeeper servers it needs, if at all possible.
The Spark executors do not need to be running on the Kafka brokers, they should work fine pulling remotely from the YARN NodeManagers.
Created 03-12-2019 01:08 AM
Thanks @Jordan Moore
As 'Kafka is very memory and disk sensitive. ', do you recommend to install kafka brokers on a virtual machine, as I cannot have more dedicated machines for kafka?
Created 03-18-2019 07:08 PM
@Junfeng Chen, as mentioned, it depends on your use of it. It will run okay in most deployment patterns, and it can run fine in VMs, but of course having dedicated hardware is always preffered.