Support Questions

Find answers, ask questions, and share your expertise

Spark-Streaming and dynamic allocation

avatar
Contributor

I want to setup a spark streaming-cluster with dynamic allocation. I tested the dynamic allocation on it by submitting the SparkPi application and the dynamic allocation works fine.

Then I tried my own application: It get his input from an other server and I used the socketTextStream to receiving the input data.

The application is very simple:

final JavaReceiverInputDStream<String> stream = 
	ssc.socketTextStream(host, port, StorageLevels.MEMORY_AND_DISK_SER);
final JavaDStream<MyObject> mapStream = stream.map(...);
mapStream.foreachRDD(new VoidFunction<JavaRDD<MyObject>>() {
	@Override
	public void call(final JavaRDD<MyObject> stringJavaRDD) throws Exception {
		stringJavaRDD.collect();
        }
});

The map function is the core of my application and it need some milliseconds computation time for each event.

When I increase the number of events, the cluster allocate new containers and when slowing down the events the number of containers decrease, but the new allocated containers are never used. I checked the number of computed tasks per executor, they are always 0. Because the containers are never used, spark deallocate them after executorIdleTimeout and immediately allocate a new one because the workload is very high. Only the first 2 containers from the application start do the jobs.

I though, maybe it could help if I use Kafka for receiving the events distributed to all containers. (I don't know if it is the right way to solve my problem.)

With Kafka I got an other problem: Spark didn't allocate more container as I have partitions set for my topic.

I use HDP 2.4 to setup my 3 node cluster with:

  • Zookeeper: 3.4
  • HDFS, YARN, MapReduce2: 2.7
  • spark: 1.6
  • Kafka 0.9

Each node has 4 cores an 8 GB RAM.

Can you tell me, which option is the right for solving my problem an how can I fix it?

Thank you so much for helping me 🙂

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Please keep in mind (from HDP 2.4 docs😞

[Important>Dynamic Resource Allocation does not work with Spark Streaming.

View solution in original post

9 REPLIES 9

avatar
Expert Contributor

Hi

@Rene ReneCan you also share the following properties value ? and how much memory and cores are your nodes contributing to yarn ( You can see the value in Resource Manager UI -> Nodes )

yarn.scheduler.minimum-allocation-mb

yarn.nodemanager.resource.memory-mb

mapreduce.map.memory.mb

mapreduce.map.java.opts.max.heap

mapreduce.map.cpu.vcores

avatar
Contributor

Each node contribute 4GB RAM and 4 cores to yarn. I submit the application with:

--driver-memory 640mb
--executor-memory 640mb

With the overhead each container use 1 GB RAM. Altogether 12 containers are possible.

yarn.scheduler.minimum-allocation-mb=512mb
yarn.nodemanager.resource.memory-mb=4096mb
mapreduce.map.memory.mb=1.5GB

The following properties are not defined:

mapreduce.map.java.opts.max.heap
mapreduce.map.cpu.vcores

avatar
Super Collaborator

Please keep in mind (from HDP 2.4 docs😞

[Important>Dynamic Resource Allocation does not work with Spark Streaming.

avatar
Contributor

Oh no.... I did't saw this before. Is this only a HDP 2.4 problem? On CDH-Doc it seems to be possible.

avatar
Super Collaborator

avatar

Spark Streaming & Dynamic Resource Allocation is a new feature with Spark 2.0 (https://issues.apache.org/jira/browse/SPARK-12133) so it is not yet available in either HDP or CDH.

avatar
New Contributor

Hello, HDP 2.6.1 has Spark 2, is dynamic resource allocation for streaming jobs working now?

avatar
New Contributor

Hello, HDP 2.6.1 has Spark 2, is dynamic resource allocation for streaming jobs working now?

avatar
New Contributor

Hello, HDP 2.6.1 has Spark 2, is dynamic resource allocation for streaming jobs working now?