Member since
01-22-2014
62
Posts
0
Kudos Received
0
Solutions
12-18-2014
05:32 AM
Hi, I have a local repository pointing to which I want to do the CM5 installation. I have created the local.repo file in /etc/yum.repos.d and given the repo path there (and it is accessible). I execute the command ./cloudera-manager-installer.bin --skip_repo_package=1 to install cloudera manager from the local repo. But in the cloudera manager, when I procedd with the installation, the installation fails as a new cloudera-manager.repo file is created in the /etc/yum.repos.d directory everytime and it is pointing to the archive.cloudera.com site. Hence my installation is failing with the below message. Please help to solve this. Repository cloudera-manager is listed more than once in the configuration
http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/5.2.0/repodata/repomd.xml: [Errno -1] Error importing repomd.xml for cloudera-manager: Damaged repomd.xml file
Trying other mirror.
Error: Cannot retrieve repository metadata (repomd.xml) for repository: cloudera-manager. Please verify its path and try again
... View more
Labels:
- Labels:
-
Cloudera Manager
10-07-2014
11:20 PM
When I execute the following in yarn-client mode its working fine and giving the result properly, but when i try to run in Yarn-cluster mode i am getting error spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client /home/abc/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0.jar 10 The above code works fine, but when i execute the same code in yarn cluster mode i amgetting the following error. 14/10/07 09:40:24 INFO Client: Application report from ASM:
application identifier: application_1412117173893_1150
appId: 1150
clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service: }
appDiagnostics:
appMasterHost: N/A
appQueue: root.default
appMasterRpcPort: -1
appStartTime: 1412689195537
yarnAppState: ACCEPTED
distributedFinalState: UNDEFINED
appTrackingUrl: http://spark.abcd.com:8088/proxy/application_1412117173893_1150/
appUser: abc
14/10/07 09:40:25 INFO Client: Application report from ASM:
application identifier: application_1412117173893_1150
appId: 1150
clientToAMToken: null
appDiagnostics: Application application_1412117173893_1150 failed 2 times due to AM Container for appattempt_1412117173893_1150_000002 exited with exitCode: 1 due to: Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:511)
at org.apache.hadoop.util.Shell.run(Shell.java:424)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:656)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:279)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
main : command provided 1
main : user is abc
main : requested yarn user is abc
Container exited with a non-zero exit code 1
.Failing this attempt.. Failing the application.
appMasterHost: N/A
appQueue: root.default
appMasterRpcPort: -1
appStartTime: 1412689195537
yarnAppState: FAILED
distributedFinalState: FAILED
appTrackingUrl: spark.abcd.com:8088/cluster/app/application_1412117173893_1150
appUser: abc Where may be the problem? sometimes when i try to execute in yarn-cluster mode i am getting the following , but i dint see any result 14/10/08 01:51:57 INFO Client: Application report from ASM:
application identifier: application_1412117173893_1442
appId: 1442
clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service: }
appDiagnostics:
appMasterHost: spark.abcd.com
appQueue: root.default
appMasterRpcPort: 0
appStartTime: 1412747485673
yarnAppState: FINISHED
distributedFinalState: SUCCEEDED
appTrackingUrl: http://spark.abcd.com:8088/proxy/application_1412117173893_1442/A
appUser: abc Thanks
... View more
Labels:
- Labels:
-
Apache Spark
09-15-2014
04:51 AM
I am joining two datasets , first one coming from stream and second one which is in HDFS. After joining the two datasets , I need to apply filter on the joined datasets, but here I am facing as issue. Please assist to resolve. I am using the code below, val streamkv = streamrecs.map(_.split("~")).map(r => ( r(0), (r(5), r(6))))
val HDFSlines = sc.textFile("/user/Rest/sample.dat").map(_.split("~")).map(r
=> ( r(1), (r(0) r(3),r(4),)))
val streamwindow = streamkv.window(Minutes(1)) val join1 = streamwindow.transform(joinRDD => { joinRDD.join(HDFSlines)} ) I am getting the following error, when I use the filter val tofilter = join1.filter {
| case (_, (_, _),(_,_,device)) =>
| device.contains("iPhone")
| }.count() error: constructor cannot be instantiated to expected type; found : (T1, T2, T3) required: (String, ((String, String), (String, String, String))) case (_, (_, _),(_,_,device)) => How can I solve this error?.
... View more
Labels:
- Labels:
-
Apache Spark
09-12-2014
06:35 AM
Hi - Does it make a difference if I use a "--master yarn-client" or " --master yarn-cluster" for this error in "spark-submit" since yarn-client uses a local driver?
... View more
09-12-2014
06:10 AM
By latest do you mean the version 1.1.0? So does the version 1.0.0 that comes with CDH5.1 does not have this feature?
... View more
09-12-2014
03:09 AM
Hi, I am streaming data in Spark and doing a join operation with a batch file in HDFS. I am joining one window of the stream with HDFS. I want to calculate the time taken to do this join (for each window) using the below code, but it did not work. (the output was 0 always). I am using the Spark-Shell for this code. Any suggestions on how to achieve this? Thanks! val jobstarttime = System.currentTimeMillis();
val ssc = new StreamingContext(sc, Seconds(60))
val streamrecs = ssc.socketTextStream("10.11.12.13", 5549)
val streamkv = streamrecs.map(_.split("~")).map(r => ( r(0), (r(5), r(6))))
val streamwindow = streamkv.window(Minutes(2))
val HDFSlines = sc.textFile("/user/batchdata").map(_.split("~")).map(r => ( r(1), (r(0))))
val outfile = new PrintWriter(new File("//home//user1//metrics1" ))
val joinstarttime = System.currentTimeMillis();
val join1 = streamwindow.transform(joinRDD => { joinRDD.join(HDFSlines)} )
val joinsendtime = System.currentTimeMillis();
val jointime = (joinsendtime - joinstarttime)/1000
val J = jointime.toString()
val J1 = "\n Time taken for Joining is " + J
outfile.write(J1)
join1.print()
val savestarttime = System.currentTimeMillis();
join1.saveAsTextFiles("/user/joinone5")
val savesendtime = System.currentTimeMillis();
val savetime = (savesendtime - savestarttime)/1000
val S = savetime.toString()
val S1 = "\n Time taken for Saving is " + S
outfile.write(S1)
ssc.start()
outfile.close()
ssc.awaitTermination()
... View more
Labels:
- Labels:
-
Apache Spark
09-11-2014
11:38 PM
Thanks. Please clarify the below - What is the port range that I need to ask the admin team to open on each worker node? And what are these ports used for, Spark Workers already use the port 7078 right? Are these random ports opened for each spark job ?
... View more
09-11-2014
08:22 AM
Hi, The worker logs show the following connection erros - Any idea how to resolve? AssociationError [akka.tcp://sparkWorker@host1:7078] -> [akka.tcp://sparkExecutor@worker1:33912]:
Error [Association failed with [akka.tcp://sparkExecutor@worker1:33912]]
[akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@worker1:33912]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: worker1/10.11.11.11:33912]
... View more
09-11-2014
03:09 AM
Hi, I am submitting a Spark Streaming job using spark-submit. "spark-submit --class "test.Main" --master yarn-client testjob.jar" But I am facing the below errors. Please assist to resolve. 14/09/11 05:55:45 ERROR YarnClientClusterScheduler: Lost executor 4 on host1.com: remote Akka client disassociated
14/09/11 05:56:06 ERROR JobScheduler: Error running job streaming job 1410429330000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:0 failed 4 times, most recent failure: TID 3 on host host2.com failed for unknown reason
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:0 failed 4 times, most recent failure: TID 3 on host host3.com failed for unknown reason
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache YARN
09-10-2014
08:04 AM
Hi, I am joining my streming data with data which is already present in HDFS. When i use scala shell its working fine, and the data is getting joined. But when i try to compile the same code in eclipse to make as a jar, the joining part is not working. Please give some suggestion to solve the issue. I am facing the error in the following part.. val streamkv = streamrecs.map(_.split("~")).map(r => ( r(0), (r(5), r(6)))) val HDFSlines = sc.textFile("/user/Rest/sample.dat").map(_.split("~")).map(r => ( r(1), (r(0)))) val streamwindow = streamkv.window(Minutes(1)) val join = streamwindow.transform(joinRDD => { joinRDD.join(HDFSlines)} ) In this step i am getting error that Value join is not a member of org.apache.spark.rdd.RDD(String,(String,String))) I used the same code in scala-shell, there its working fine.. I have imported all the necessary packages as below - import scala.io.Source import java.io._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.api.java.function._ import org.apache.spark.streaming._ import org.apache.spark.streaming.api._ import org.apache.spark.streaming.StreamingContext._ import StreamingContext._ import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark._ import org.apache.spark.rdd.PairRDDFunctions import org.apache.spark.streaming.dstream.PairDStreamFunctions
... View more
Labels:
- Labels:
-
Apache Spark
-
HDFS
09-05-2014
07:56 AM
I really appreciate all the answers given by you today! It clarifies a lot ! Thanks! Just one final question - I believe collect() , take() and print() are there only functions that put load upon the driver? Is my understanding correct? Or is there any other documentation on this?
... View more
09-05-2014
05:47 AM
Thanks a lot for the explanation! So, what if there is a situation where in my final result data set is of big size? Will I be always restrcted by the memory of the driver while getting the output?
... View more
09-05-2014
04:21 AM
I am calling .collect() just to display to see the data in the file. So, what is the method to distribute the data to the workers instead of overloading the driver and then access the data? Should I use any other function that .collect() ? How can we cache a 1 GB file into memory in Spark? Please let me know.
... View more
09-05-2014
03:15 AM
Even "please copy all of the results into memory on the driver" ie - .collect() statement for a 250 MB file is supposed to work right in a Spark Cluster with 20 Worker Nodes of 512 MB each. Is the data not supposed to be distributed across all the worker nodes? If not , are we limited by the memory available node? Or is there any other way in Spark to handle this 250 MB file and bigger files and cache them?
... View more
09-05-2014
01:23 AM
Hi, Thanks for the quick response. For ths first error - Every worker has a 512 MB RAM allocated to it and I am working on a Spark gateway node and trying to load a 250 MB file.My understanding was that Spark would distribute this data workload across its workers (more than 20 workers each with 512 MB RAM).But I am getting an error. What do you think can be done here, should I increase the RAM allocated to each worker? In that case am I not limiited by the RAM on a single worker node even though I have have large number of worker nodes? Will this occuer even if I use Spark-Submit instead of Spark-Shell? For the Second Error - I will look into the logs for this, but this one was also related to Memory. For the third error - HDFS health is fine and the dates nodes have 16 GB RAM each. What i meant was every Spark worker has 512 MB RAM and I have more than 20 workers. Thanks.
... View more
09-04-2014
11:07 PM
Hi, I am working on a spark cluster with more than 20 worker nodes and each node with a memory of 512 MB. I am trying to acces file in HDFS in Spark. I am facing issues even when accessing files of size around 250 MB using spark (both with and without caching). No other Spark processes are running when I am doing this and no upper limit is set for Spark Resources in YARN. I am accessing spark through spark shell using the command “spark-shell --master yarn-client” Below are the error messages and the corresponding code. Can you provide some direction on how to solve this? Thanks! Error -1
--------
val k = sc.textFile("/user/usera/dira/masterbigd")
k.collect()
14/09/04 09:55:47 WARN TaskSetManager: Loss was due to java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:178)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent failure: TID 6 on host host1.com failed for unknown reason
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Error - 2
-----------
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class Master(sor: String,accnum: Int, accnumkey: String)
val masterrdd = sc.textFile("/user/usera/dira/masterbigd").map(_.split(",")).map(k => Master(k(0),k(1).toInt,k(2)))
masterrdd.registerAsTable("masterTBL")
val entrdd = sql("Select sor, accnum, accnumkey from masterTBL")
entrdd.collect()
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:1 failed 4 times, most recent failure: TID 6 on host host2.com failed for unknown reason
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Error - 3 (Caching used)
------------------------
val k = sc.textFile("/user/usera/dira/masterbigd")
val cached = k.cache()
cached.collect()
java.net.SocketTimeoutException: 75000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.122.123.14:42968 remote=/10.123.123.345:1004]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.read(DataInputStream.java:149)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.security.SaslInputStream.readMoreData(SaslInputStream.java:96)
at org.apache.hadoop.security.SaslInputStream.read(SaslInputStream.java:201)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1986)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796)
14/09/04 10:55:31 WARN DFSClient: Error Recovery for block BP-822007753-10.25.36.66-1400458079994:blk_1074486472_783590 in pipeline 10.123.123.32:1004, 10.123.123.33:1004, 10.123.123.34:1004: bad datanode 10.123.123.35:1004
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
-
Apache YARN
-
HDFS
-
Security
08-13-2014
11:05 PM
Thanks for the solution.Will try the options available and give the feedback..
... View more
08-13-2014
08:22 AM
Hi, I am using Spark Streaming to write data into HDFS using the below program. import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.api.java.function._
import org.apache.spark.streaming._
import org.apache.spark.streaming.api._
import org.apache.spark.streaming.StreamingContext._
import StreamingContext._
val ssc8 = new StreamingContext(sc, Seconds(60))
val lines8 = ssc8.socketTextStream("IP", 19900) port alone need to be changed
lines8.saveAsTextFiles("hdfs://nameservice1/user/sample4") need to be changed
lines8.print()
ssc8.start()
ssc8.awaitTermination() The output gets written to HDFS as a directory. But inside the directory, there are a huge number of part files (Mine is a cluster with 30+ spark nodes). Please let me know about the below. Thanks! 1. I want to control the number of files created in the directory in HDFS. Is there a way to do it (by controlling based on number of events or size of data etc) as we do it in Flume. 2. The files started getting created in HDFS only after I killed the program. Is there a way to write the files as soon as the streams are read? Below is a sample of the files created. drwx--x--- - USER1 USER1 0 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/_temporary
-rw------- 3 USER1 USER1 706 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00000
-rw------- 3 USER1 USER1 1412 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00001
-rw------- 3 USER1 USER1 1412 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00002
-rw------- 3 USER1 USER1 1418 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00003
-rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00004
-rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00005
-rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00006
-rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00007
-rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00008
-rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00009
-rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00010
-rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00011
-rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00012
-rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00013
-rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00014
-rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00015
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00016
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00017
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00018
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00019
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00020
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00021
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00022
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00023
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00024
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00025
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00026
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00027
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00028
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00029
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00030
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00031
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00032
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00033
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00034
-rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00035
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache Spark
-
HDFS
08-04-2014
11:14 PM
Hi, I came to know that Spark does not work in Stand alone mode in a Kerberos enabled cluster but works only in YARN mode. But I was not able to figure out why? Anyone has an explanation or documentation related to this? Thanks, Arun
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache YARN
-
Kerberos
08-03-2014
02:48 AM
Hi, I am using Spark 1.0.0 in a CDH 5.1 VM. I am trying to use the Spark-Shell (Scala). But I am getting the below error regularly and not able to use thecontect variable Please assist. Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_55)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
scala> val text = sc.textfile("/user/cloudera/test1")
<console>:12: error: value textfile is not a member of org.apache.spark.SparkContext
val text = sc.textfile("/user/cloudera/test1")
... View more
Labels:
04-07-2014
12:01 AM
Apache Spark uses the derby database in background and hence only one instance of the 'Spark-Shell' can be connected at any time. Is there any way to configure mysql or any other RDMS and is there any configuration document?
... View more
04-02-2014
09:38 PM
But I have not manually requested memory anywhere or set any parameter. So, is there a way to control this? Thanks!
... View more
04-02-2014
07:22 AM
Hi, I am unable to run even the sample vertification job in Scala. The worker node status is howing as alive with cores 4 (0 Used) and memory 6.7 GB (0.0 B Used). But I am repeatedly getting the below error. Could you please assist? WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
... View more
03-27-2014
05:15 AM
I removed the version I installed and used the one available in CDH-5, it worked. Thanks!
... View more
03-25-2014
04:24 AM
Hi, I have installed Spark in Stand alone mode in CDH-5 cluster. But when I start the Spark Master, I am getting the following error - Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/deploy/master/Master
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.master.Master
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.deploy.master.Master. Program will exit. I have provided the folder '/usr/lib/spark' in class path and also set the variable ' SPARK_LIBRARY_PATH=/usr/lib/spark' in the file spark-env.sh. Still I am facing this error. I installed SPARK using yum. Could you please assist? Thanks!
... View more
03-24-2014
06:48 AM
Hi, I am a newbie to Apache Spark. I have installed CDH-5 using parcels (Beta 2 Version) and installed Spark also As per the Spark installation documentation, http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_Spark_prerequisites.html#../CDH5-Installation-Guide/../CDH5-Installation-Guide/cdh5ig_Spark_configuring.html, it is said, " Note: The current version of CDH 5 does not support running Spark on YARN. The current version of Spark does work in a secure cluster." So, if YARN in CDH-5 does not support Spark, how do we run Spark in CDH-5? Please let me know and also proivde any documentation if available. Thanks!
... View more
Labels: