I meet a significant performance problem recently.
I have about 30 spark streaming applications, which read data from kafka and write the data to hdfs. But recently the writing progress on some spark executor become very slow. The data amount for each spark tasks are similar, but the time cost of tasks are in great difference, where the slowest one is about 4 times of the fastest one.
I have checked the disk usage, where disk use time on some hosts are about 80% to 90%.
So I guess if it is caused by slow hdfs writing speed, because of my kafka broker, hdfs data node, yarn nodemanager locating on same hosts.
So will it actually affect the performance?
"Can they" - Yes.
"Should they" - I would say no.
Kafka is very memory and disk sensitive. Depending on your use of it, it could even use more I/O than the combination of the DataNode and NodeManager on the same machine.
Personally, I would recommend installing Kafka brokers on dedicated hardware, even separate from the Zookeeper servers it needs, if at all possible.
The Spark executors do not need to be running on the Kafka brokers, they should work fine pulling remotely from the YARN NodeManagers.