About ArunShell

ArunShell · ‎12-18-2014

Hi, I have a local repository pointing to which I want to do the CM5 installation. I have created the local.repo file in /etc/yum.repos.d and given the repo path there (and it is accessible). I execute the command ./cloudera-manager-installer.bin --skip_repo_package=1 to install cloudera manager from the local repo. But in the cloudera manager, when I procedd with the installation, the installation fails as a new cloudera-manager.repo file is created in the /etc/yum.repos.d directory everytime and it is pointing to the archive.cloudera.com site. Hence my installation is failing with the below message. Please help to solve this. Repository cloudera-manager is listed more than once in the configuration http://archive.cloudera.com/cm5/redhat/6/x86_64/cm/5.2.0/repodata/repomd.xml: [Errno -1] Error importing repomd.xml for cloudera-manager: Damaged repomd.xml file Trying other mirror. Error: Cannot retrieve repository metadata (repomd.xml) for repository: cloudera-manager. Please verify its path and try again

ArunShell · ‎10-07-2014

When I execute the following in yarn-client mode its working fine and giving the result properly, but when i try to run in Yarn-cluster mode i am getting error spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client /home/abc/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0.jar 10 The above code works fine, but when i execute the same code in yarn cluster mode i amgetting the following error. 14/10/07 09:40:24 INFO Client: Application report from ASM: application identifier: application_1412117173893_1150 appId: 1150 clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service: } appDiagnostics: appMasterHost: N/A appQueue: root.default appMasterRpcPort: -1 appStartTime: 1412689195537 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://spark.abcd.com:8088/proxy/application_1412117173893_1150/ appUser: abc 14/10/07 09:40:25 INFO Client: Application report from ASM: application identifier: application_1412117173893_1150 appId: 1150 clientToAMToken: null appDiagnostics: Application application_1412117173893_1150 failed 2 times due to AM Container for appattempt_1412117173893_1150_000002 exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:511) at org.apache.hadoop.util.Shell.run(Shell.java:424) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:656) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:279) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) main : command provided 1 main : user is abc main : requested yarn user is abc Container exited with a non-zero exit code 1 .Failing this attempt.. Failing the application. appMasterHost: N/A appQueue: root.default appMasterRpcPort: -1 appStartTime: 1412689195537 yarnAppState: FAILED distributedFinalState: FAILED appTrackingUrl: spark.abcd.com:8088/cluster/app/application_1412117173893_1150 appUser: abc Where may be the problem? sometimes when i try to execute in yarn-cluster mode i am getting the following , but i dint see any result 14/10/08 01:51:57 INFO Client: Application report from ASM: application identifier: application_1412117173893_1442 appId: 1442 clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service: } appDiagnostics: appMasterHost: spark.abcd.com appQueue: root.default appMasterRpcPort: 0 appStartTime: 1412747485673 yarnAppState: FINISHED distributedFinalState: SUCCEEDED appTrackingUrl: http://spark.abcd.com:8088/proxy/application_1412117173893_1442/A appUser: abc Thanks

ArunShell · ‎09-15-2014

I am joining two datasets , first one coming from stream and second one which is in HDFS. After joining the two datasets , I need to apply filter on the joined datasets, but here I am facing as issue. Please assist to resolve. I am using the code below, val streamkv = streamrecs.map(_.split("~")).map(r => ( r(0), (r(5), r(6)))) val HDFSlines = sc.textFile("/user/Rest/sample.dat").map(_.split("~")).map(r => ( r(1), (r(0) r(3),r(4),))) val streamwindow = streamkv.window(Minutes(1)) val join1 = streamwindow.transform(joinRDD => { joinRDD.join(HDFSlines)} ) I am getting the following error, when I use the filter val tofilter = join1.filter { | case (_, (_, _),(_,_,device)) => | device.contains("iPhone") | }.count() error: constructor cannot be instantiated to expected type; found : (T1, T2, T3) required: (String, ((String, String), (String, String, String))) case (_, (_, _),(_,_,device)) => How can I solve this error?.

ArunShell · ‎09-12-2014

Hi - Does it make a difference if I use a "--master yarn-client" or " --master yarn-cluster" for this error in "spark-submit" since yarn-client uses a local driver?

ArunShell · ‎09-12-2014

By latest do you mean the version 1.1.0? So does the version 1.0.0 that comes with CDH5.1 does not have this feature?

ArunShell · ‎09-12-2014

Thanks!

ArunShell · ‎09-12-2014

Thanks! By Spark Streaming UI, do you mean the Spark Master UI?

ArunShell · ‎09-12-2014

Hi, I am streaming data in Spark and doing a join operation with a batch file in HDFS. I am joining one window of the stream with HDFS. I want to calculate the time taken to do this join (for each window) using the below code, but it did not work. (the output was 0 always). I am using the Spark-Shell for this code. Any suggestions on how to achieve this? Thanks! val jobstarttime = System.currentTimeMillis(); val ssc = new StreamingContext(sc, Seconds(60)) val streamrecs = ssc.socketTextStream("10.11.12.13", 5549) val streamkv = streamrecs.map(_.split("~")).map(r => ( r(0), (r(5), r(6)))) val streamwindow = streamkv.window(Minutes(2)) val HDFSlines = sc.textFile("/user/batchdata").map(_.split("~")).map(r => ( r(1), (r(0)))) val outfile = new PrintWriter(new File("//home//user1//metrics1" )) val joinstarttime = System.currentTimeMillis(); val join1 = streamwindow.transform(joinRDD => { joinRDD.join(HDFSlines)} ) val joinsendtime = System.currentTimeMillis(); val jointime = (joinsendtime - joinstarttime)/1000 val J = jointime.toString() val J1 = "\n Time taken for Joining is " + J outfile.write(J1) join1.print() val savestarttime = System.currentTimeMillis(); join1.saveAsTextFiles("/user/joinone5") val savesendtime = System.currentTimeMillis(); val savetime = (savesendtime - savestarttime)/1000 val S = savetime.toString() val S1 = "\n Time taken for Saving is " + S outfile.write(S1) ssc.start() outfile.close() ssc.awaitTermination()

ArunShell · ‎09-12-2014

This makes sense - thanks!

ArunShell · ‎09-11-2014

Thanks. Please clarify the below - What is the port range that I need to ask the admin team to open on each worker node? And what are these ports used for, Spark Workers already use the port 7078 right? Are these random ports opened for each spark job ?

Online	Offline
Last Visited	‎06-08-2015 12:25 PM

Member Since	‎01-22-2014 04:58 AM
Last Visited	‎06-08-2015 12:25 PM
Posts	62

Cloudera Community

Unable to use the the --skip_repo_package=1 option

Issue on running spark application in Yarn-cluster...

Using filter in joined dataset in spark ?

Re: Akka Error while running Spark Jobs

Re: Metrics for a Spark Streaming Operation

Re: Metrics for a Spark Streaming Operation

Re: Metrics for a Spark Streaming Operation

Metrics for a Spark Streaming Operation

Re: Akka Error while running Spark Jobs

Re: Akka Error while running Spark Jobs