Support Questions

bdelpizzo · ‎11-17-2017

I've a spark streaming application that reads from 4 different kafka topics and each topic has 3 partitions. Reading operation is done in different instants (I have 4 pipeline processed in sequence) so in my idea I need just 3 spark executor (one for each partition of each topic) with one core each. Submitting the application in this way I can see that execution is not parallelized between executor and processing time is very high respect to the complexity of the computation. What's wrong with this assumption?

If I run the same application with 4 executors with 4 cores each the execution is parallelized through all the executors and processig time is low.

I'm wondering if exists a best practices in terms of executor for topic/partition and cores while consuming from a kafka topic with spark streaming.

Thanks in advance,

Beniamino

srowen · ‎11-17-2017

If you have 4 topics with 3 partitions each then you need 12 executor slots to process fully in parallel. You have only 3 slots. If you are using receiver based streaming you may need 1 more, too.

Also, 1 core per executor is generally very low.

Your result is therefore not surprising and your second config much more reasonable.

bdelpizzo · ‎11-17-2017

I'm using directStream and topics are read one by one so I was thinking that 3 tasks were enough.

Strange thing is that I'm observing a different behavior running the same application on another cluster.

The second cluster is smaller than the first, it has 3 brokers instead of 4. In order to reach good performance I need to run the application with 6 executors with 1 core each and I can see that only 3 executors receive the work.

The described scenario could be related to the architecture of the cluster?

Thanks again,

Beniamino

samthebest · ‎09-02-2018

@srowen Is 12 executors really necessary? Surely you just need a total of 12 cores (so you could have 1 executor with 12 cores).

Is this what you mean by "Also, 1 core per executor is generally very low."?

What happens when you have more cores than kafka partitions? will it generall run faster?

Cloudera Community

Support Questions

What's the right number of cores and executors for a spark streaming application?