Member since
01-10-2017
25
Posts
0
Kudos Received
0
Solutions
03-03-2017
11:27 AM
@srowen i don't think the upstream transformations and getting the stream's are not causing any delays ...as highlighted in the pic below .....only thing that's in minutes is foreachRDD (Eventhough there is no code in in it) Stage Execution times
... View more
03-03-2017
07:59 AM
@sroweni tried with foreachPartition too ........it didn't improve any time ..also if as you said if my creating connection is the problem as i explained in the question how come the empty loop took the same time 3 hours ...could you please explain that ?.... jus to isolate the problem if the connection code is the cause of slowness i ran the test with empty loop....and looks like it's not ....
... View more
03-03-2017
07:22 AM
We have a Spark streaming application which ingests data @10,000/ sec ... We use the foreachRDD operation on our DStream( since spark doesn't execute unless it finds the output operation on DStream) so we have to use the foreachRDD output operation like this , it takes upto to 3 hours ...to write a singlebatch of data (10,000) which is slow requestsWithState is a DStream CodeSnippet 1: requestsWithState.foreachRDD { rdd =>
rdd.foreach {
case (topicsTableName, hashKeyTemp, attributeValueUpdate) => {
val client = new AmazonDynamoDBClient()
val request = new UpdateItemRequest(topicsTableName, hashKeyTemp, attributeValueUpdate)
try client.updateItem(request)
catch {
case se: Exception => println("Error executing updateItem!\nTable ", se)
}
}
case null =>
}
}
} So i thought the code inside foreachRDD might be the problem so commented it out to see how much time it takes ....to my surprise ...even with nocode inside the foreachRDD it still run's for 3 hours CodeSnippet 2: requestsWithState.foreachRDD {
rdd => rdd.foreach {
// No code here still takes a lot of time ( there used to be code but removed it to see if it's any faster without code) //
}
} Please let us know if we are missing anything or an alternative way to do this as i understand without a output operation on DStream spark streaming application will not run .. at this time i can't use other output operations Note : To isolate the problem and make sure that dynamo code is not problem ...i ran with empty loop .....look's like foreachRDD is slow on it's own when iterating over a huge record set coming in @10,000/sec ...and not the dynamo code as empty foreachRDD and with dynamo code took the same time ...
... View more
Labels:
- Labels:
-
Apache Spark
02-02-2017
11:59 AM
Cluster capacity is 1 Master/Driver Node : Memory :24GB Cores :8
4 Worker Nodes : Memory :24GB Cores :8 Yes we are following the formula as mentioned
... View more
02-02-2017
07:55 AM
@saranvisa unfortunately we are not using cloudera manager .....we are using apache hadoop 2.7.3 and yarn that comes along with it .......also i made sure yarn-site.xml is updated on all nodes and have same values ... this is what yarn reflecting this is what configured in yarn-site.xml which is configured for 22GB and 7 cores but it's jus using 16GB and 6 cores not sure why <configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hdfs-name-node</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>22528</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>7</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>22528</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file:///tmp/hadoop/data/nm-local-dir,file:///tmp/hadoop/data/nm-local-dir/filecache,file:///tmp/hadoop/data/nm-local-dir/usercache</value>
</property>
<property>
<name>yarn.nodemanager.localizer.cache.cleanup.interval-ms</name>
<value>500</value>
</property>
<property>
<name>yarn.nodemanager.localizer.cache.target-size-mb</name>
<value>512</value>
</property>
</configuration>
... View more
02-02-2017
06:19 AM
We have an application managed by yarn when we change yarn-site.xml those changes are not reflected , application is still running with old configuration. We are new to Yarn any help in this regard will be helpful Note : we have already tried restarted yarn using stop-yarn.sh and start-yarn.sh also restared dfs using start-dfs.sh and stop-dfs.sh . We are using hadoop 2.7.3
... View more
Labels:
- Labels:
-
Apache YARN