About ArunShell

ArunShell · ‎09-11-2014

Hi, The worker logs show the following connection erros - Any idea how to resolve? AssociationError [akka.tcp://sparkWorker@host1:7078] -> [akka.tcp://sparkExecutor@worker1:33912]: Error [Association failed with [akka.tcp://sparkExecutor@worker1:33912]] [akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@worker1:33912] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: worker1/10.11.11.11:33912]

ArunShell · ‎09-11-2014

Hi, I am submitting a Spark Streaming job using spark-submit. "spark-submit --class "test.Main" --master yarn-client testjob.jar" But I am facing the below errors. Please assist to resolve. 14/09/11 05:55:45 ERROR YarnClientClusterScheduler: Lost executor 4 on host1.com: remote Akka client disassociated 14/09/11 05:56:06 ERROR JobScheduler: Error running job streaming job 1410429330000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:0 failed 4 times, most recent failure: TID 3 on host host2.com failed for unknown reason Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:0 failed 4 times, most recent failure: TID 3 on host host3.com failed for unknown reason Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

ArunShell · ‎09-10-2014

Hi, I am joining my streming data with data which is already present in HDFS. When i use scala shell its working fine, and the data is getting joined. But when i try to compile the same code in eclipse to make as a jar, the joining part is not working. Please give some suggestion to solve the issue. I am facing the error in the following part.. val streamkv = streamrecs.map(_.split("~")).map(r => ( r(0), (r(5), r(6)))) val HDFSlines = sc.textFile("/user/Rest/sample.dat").map(_.split("~")).map(r => ( r(1), (r(0)))) val streamwindow = streamkv.window(Minutes(1)) val join = streamwindow.transform(joinRDD => { joinRDD.join(HDFSlines)} ) In this step i am getting error that Value join is not a member of org.apache.spark.rdd.RDD(String,(String,String))) I used the same code in scala-shell, there its working fine.. I have imported all the necessary packages as below - import scala.io.Source import java.io._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.api.java.function._ import org.apache.spark.streaming._ import org.apache.spark.streaming.api._ import org.apache.spark.streaming.StreamingContext._ import StreamingContext._ import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark._ import org.apache.spark.rdd.PairRDDFunctions import org.apache.spark.streaming.dstream.PairDStreamFunctions

ArunShell · ‎09-05-2014

I really appreciate all the answers given by you today! It clarifies a lot ! Thanks! Just one final question - I believe collect() , take() and print() are there only functions that put load upon the driver? Is my understanding correct? Or is there any other documentation on this?

ArunShell · ‎09-05-2014

Thanks a lot for the explanation! So, what if there is a situation where in my final result data set is of big size? Will I be always restrcted by the memory of the driver while getting the output?

ArunShell · ‎09-05-2014

I am calling .collect() just to display to see the data in the file. So, what is the method to distribute the data to the workers instead of overloading the driver and then access the data? Should I use any other function that .collect() ? How can we cache a 1 GB file into memory in Spark? Please let me know.

ArunShell · ‎09-05-2014

Even "please copy all of the results into memory on the driver" ie - .collect() statement for a 250 MB file is supposed to work right in a Spark Cluster with 20 Worker Nodes of 512 MB each. Is the data not supposed to be distributed across all the worker nodes? If not , are we limited by the memory available node? Or is there any other way in Spark to handle this 250 MB file and bigger files and cache them?

ArunShell · ‎09-05-2014

Hi, Thanks for the quick response. For ths first error - Every worker has a 512 MB RAM allocated to it and I am working on a Spark gateway node and trying to load a 250 MB file.My understanding was that Spark would distribute this data workload across its workers (more than 20 workers each with 512 MB RAM).But I am getting an error. What do you think can be done here, should I increase the RAM allocated to each worker? In that case am I not limiited by the RAM on a single worker node even though I have have large number of worker nodes? Will this occuer even if I use Spark-Submit instead of Spark-Shell? For the Second Error - I will look into the logs for this, but this one was also related to Memory. For the third error - HDFS health is fine and the dates nodes have 16 GB RAM each. What i meant was every Spark worker has 512 MB RAM and I have more than 20 workers. Thanks.

ArunShell · ‎09-04-2014

Hi, I am working on a spark cluster with more than 20 worker nodes and each node with a memory of 512 MB. I am trying to acces file in HDFS in Spark. I am facing issues even when accessing files of size around 250 MB using spark (both with and without caching). No other Spark processes are running when I am doing this and no upper limit is set for Spark Resources in YARN. I am accessing spark through spark shell using the command “spark-shell --master yarn-client” Below are the error messages and the corresponding code. Can you provide some direction on how to solve this? Thanks! Error -1 -------- val k = sc.textFile("/user/usera/dira/masterbigd") k.collect() 14/09/04 09:55:47 WARN TaskSetManager: Loss was due to java.lang.OutOfMemoryError java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: TID 6 on host host1.com failed for unknown reason Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Error - 2 ----------- val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ case class Master(sor: String,accnum: Int, accnumkey: String) val masterrdd = sc.textFile("/user/usera/dira/masterbigd").map(_.split(",")).map(k => Master(k(0),k(1).toInt,k(2))) masterrdd.registerAsTable("masterTBL") val entrdd = sql("Select sor, accnum, accnumkey from masterTBL") entrdd.collect() org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 4 times, most recent failure: TID 6 on host host2.com failed for unknown reason Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Error - 3 (Caching used) ------------------------ val k = sc.textFile("/user/usera/dira/masterbigd") val cached = k.cache() cached.collect() java.net.SocketTimeoutException: 75000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.122.123.14:42968 remote=/10.123.123.345:1004] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.DataInputStream.read(DataInputStream.java:149) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.security.SaslInputStream.readMoreData(SaslInputStream.java:96) at org.apache.hadoop.security.SaslInputStream.read(SaslInputStream.java:201) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1986) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796) 14/09/04 10:55:31 WARN DFSClient: Error Recovery for block BP-822007753-10.25.36.66-1400458079994:blk_1074486472_783590 in pipeline 10.123.123.32:1004, 10.123.123.33:1004, 10.123.123.34:1004: bad datanode 10.123.123.35:1004

ArunShell · ‎08-13-2014

Thanks for the solution.Will try the options available and give the feedback..

Online	Offline
Last Visited	‎06-08-2015 12:25 PM

Member Since	‎01-22-2014 04:58 AM
Last Visited	‎06-08-2015 12:25 PM
Posts	62

Cloudera Community

Re: Akka Error while running Spark Jobs

Akka Error while running Spark Jobs

Joining Streaming Data with HDFS File

Re: Memory Issues in while accessing files in Spar...

Re: Memory Issues in while accessing files in Spar...

Re: Memory Issues in while accessing files in Spar...

Re: Memory Issues in while accessing files in Spar...

Re: Memory Issues in while accessing files in Spar...

Memory Issues in while accessing files in Spark

Re: Control the number of files created from Spark...