I've a spark streaming application that reads from 4 different kafka topics and each topic has 3 partitions. Reading operation is done in different instants (I have 4 pipeline processed in sequence) so in my idea I need just 3 spark executor (one for each partition of each topic) with one core each. Submitting the application in this way I can see that execution is not parallelized between executor and processing time is very high respect to the complexity of the computation. What's wrong with this assumption?
If I run the same application with 4 executors with 4 cores each the execution is parallelized through all the executors and processig time is low.
I'm wondering if exists a best practices in terms of executor for topic/partition and cores while consuming from a kafka topic with spark streaming.
Thanks in advance,
I'm using directStream and topics are read one by one so I was thinking that 3 tasks were enough.
Strange thing is that I'm observing a different behavior running the same application on another cluster.
The second cluster is smaller than the first, it has 3 brokers instead of 4. In order to reach good performance I need to run the application with 6 executors with 1 core each and I can see that only 3 executors receive the work.
The described scenario could be related to the architecture of the cluster?
@srowen Is 12 executors really necessary? Surely you just need a total of 12 cores (so you could have 1 executor with 12 cores).
Is this what you mean by "Also, 1 core per executor is generally very low."?
What happens when you have more cores than kafka partitions? will it generall run faster?