About ArunShell

SiddhSpark · ‎09-05-2018

Thanks for this clarification. I also had the same qurery ragrding memory issue while loading data. Here you cleared doubt about file loading from HDFS. I have a similar question but the source is a local server or Cloud storage where the data size is more than driver memory ( let's say 1 GB in this case where the driver memory is 250 MB). If I fire command val file_rdd = sc.textFile("/path or local or S3") shoud Spark load the data or as you mentioned above will throgh exception? Also, is there a way to print driver available memroy in Terminal? Many Thanks, Siddharth Saraf

balance002 · ‎02-13-2017

Thank you, you are right, when I create a kadmin user on each linux machine, you can successfully submit the task!

Darren · ‎12-18-2014

It might be easier to just install the packages yourself. See Path B documentation here: http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/installation_installation.html

srowen · ‎11-19-2014

It looks like you asked for more resources than you configured YARN to offer, so check how much you can allocate in YARN and how much Spark asked for. I don't know about the ERROR; it may be a red herring. Please have a look at http://spark.apache.org/docs/latest/ for pretty good Spark docs.

srowen · ‎09-15-2014

Your signature is just a little bit off. The result of a join is not a triple, but a tuple whose second element is a tuple. You have: (_, (_, _),(_,_,device)) but I think you need: (_, ((_, _),(_,_,device)))

srowen · ‎09-12-2014

It will make a difference insofar as the driver program will run either out on the cluster (yarn-cluster) or locally (yarn-client). The same issue remains -- the processes need to talk to each other on certain ports. But it affects where the driver is and that affects what machine's ports need to be open. For example, if your ports are all open within your cluster, I expect that yarn-cluster works directly.

srowen · ‎09-12-2014

I believe it was added in 1.1, yes. I don't have a streaming app driver handy, so maybe double-check -- you will see an obvious Streaming tab if it's there. Without guaranteeing anything, I think the next CDH will have 1.1, and at any time you can run your own Spark jobs with any version under YARN.

srowen · ‎09-10-2014

I think you imported just about everything except the one thing you need to get implicit conversions that unlock the functions in PairRDDFunctions, which is where join() is defined. You need: import org.apache.spark.SparkContext._ In the shell this is imported by default.

ArunShell · ‎08-13-2014

Thanks for the solution.Will try the options available and give the feedback..

srowen · ‎08-05-2014

Why? in a kerberized environment, to access resources you need to integrate with kerberos. The Spark project hasn't implemented anything like that. YARN works with kerberos, and so it can work with kerberos by leveraging YARN. Maybe part of the answer is, why is it necessary if it works through YARN?

Online	Offline
Last Visited	‎06-08-2015 12:25 PM

Member Since	‎01-22-2014 04:58 AM
Last Visited	‎06-08-2015 12:25 PM
Posts	62

Cloudera Community

Re: Memory Issues in while accessing files in Spar...

Re: Issue on running spark application in Yarn-clu...

Re: Unable to use the the --skip_repo_package=1 op...

Re: Spark on YARN in CDH-5

Re: Using filter in joined dataset in spark ?

Re: Akka Error while running Spark Jobs

Re: Metrics for a Spark Streaming Operation

Re: Joining Streaming Data with HDFS File

Re: Control the number of files created from Spark...

Re: Using Spark on a Kerberos Cluster