About Srini_D

Srini_D · ‎12-18-2015

I am using spark standalone cluster and below are my spark-env properties. export SPARK_EXECUTOR_INSTANCES=432 export SPARK_EXECUTOR_CORES=24 export SPARK_EXECUTOR_MEMORY=36G export SPARK_DRIVER_MEMORY=24G I have 6 worker nodes and if i tried to run a job that has huge size of files and joins, it is getting stuck and failing. I could see 6 executors for the job with 24GB. Could you please provide me any links or details to tune it and understand the worker nodes and executors concepts. I referred one cloudera blog, but that is more about yarn. But, i need it for spark standalone cluster

Srini_D · ‎12-08-2015

Hi, I need to create oozie workflow that exeuctes a shell script. The shell script has curl command which downloads a specific file from the download link. As commands in shell scripts are only able to recognize hdfs directories, how could i execute the script.? Sample code: curl -o ~/test.jar http://central.maven.org/maven2/commons-lang/commons-lang/2.6/commons-lang-2.6.jar hdfs dfs -copyFromLocal ~/test.jar /user/sr/test2

Srini_D · ‎08-26-2015

Hi, My cluster is using Capacity scheduler. But, in my mapreduce program, i have used the below configuration conf.set("yarn.resourcemanager.scheduler.class", "org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"); to set the the scheduler class to Fair scheduler. 1) Does this change really take place.? as the configuration and resources are designed for capacity scheduler, does it really change the things for my application.? 2) What happens if any one submits other application at the same time.? Does this change have any impact on the other jobs.? As far as my understanding goes, NO.. Thank You.. Srini.

Srini_D · ‎08-26-2015

How about using cogroup.? Sparks' co group can work on 3 RDDs at once. The below is scala cogroup syntax i have checked, it says, it can combine two RDDs other1 and other2 at the same time. def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))] For each key k in this or other1 or other2, return a resulting RDD that contains a tuple with the list of values for that key in this, other1 and other2. I cannot work on spark as i do not have set up at office, otherwise, would love to try this. After cogroup, you can apply mapValues and merge the three sequences Thank You.

Srini_D · ‎08-18-2015

I tried the below spark scala code and got the output as mentioned below. I have tried to pass the inputs to script, but it didn't receive and when i used collect the print statement i used in the script appeared twice. My simple and very basic perl script first: #!/usr/bin/perl print("arguments $ARGV[0] \n"); // Just print the arguments. My Spark code: object PipesExample { def main(args:Array[String]){ val conf = new SparkConf(); val sc = new SparkContext(conf); val distScript = "/home/srinivas/test.pl" sc.addFile(distScript) val rdd = sc.parallelize(Array("srini")) val piped = rdd.pipe(Seq(SparkFiles.get("test.pl"))) println(" output " + piped.collect().mkString(" ")); } } Output looked like this.. output: arguments arguments 1) What mistake i have done to make it fail receiving the arguments.? 2) Why it executed twice.? If it looks too basic, please apologize me. I was trying to understand to the best and want to clear my doubts.

Srini_D · ‎08-05-2015

I want to get exact reason behind having functions serializable in Spark and want to know the if possible want to know the scenarios, where can be issues because of Serialization, As far as my understanding goes, to ensure seam less no side-effect parallel processing, instead of sending the data liike imperative paradigm, function will be sent to the node and data gets processed parallely. Is my above thought Correct.??? As far as my study, Functional programming is a very good way forward for parallel processing/concurrent programming, so i thought this is the reason. As we are passing function, is it the security reason behind having functions serializable.? Thanks In advance.

Srini_D · ‎08-05-2015

I am trying to verify cogroup join and groupByKey for PairRDDs. I could check that in Spark Java API. But, cannot do it with scala project Below is the simple code that i tried, let me know where i made mistake. object PairsCheck { def main(args: Array[String]) = { val conf = new SparkConf; val sc = new SparkContext(conf) val lines = sc.textFile("/home/test1.txt") val lines2 = sc.textFile("/home/test2.txt") val words = lines.flatMap { x => x.split("\\W+") } val words2 = lines2.flatMap { x => x.split("\\W+") } val pairs: RDD[(Int, String)] = words.map {case(x) => (x.length(), x) } val pairs2: RDD[(Int, String)] = words2.map {case(x) => (x.length(), x) } import org.apache.spark.SparkContext._ // --> Here i tried to call join/co group functions that applies for pairsRDD, but could not do that. If i call join, it is throwing error. Thank you in advance.

Srini_D · ‎08-05-2015

I was recollecing few information about spark in documentation and crossed the below point " Spark persist call on its own does not force evaluation.?"

Srini_D · ‎07-20-2015

Hi, I am trying a solution, in which i have to take files recursively from the folder and perform few calculations and send the required data to output folder. There is little dependency on the date of the file. I am using multipleoutputs for now, to have that date in file name. But, it will be very good, if i there is a chance of creating sub folders in the output path with dates and move the respective files into that folder. One way is to use FileSystem and create required directory and write the file output, without any output from mapreduce. But, is there a way i can make mapreduce to write into required sub folder.? might not be possible, but just want to give a thought.

Srini_D · ‎07-20-2015

Thank you Harsh

Online	Offline
Last Visited	‎09-18-2019 01:00 PM

Member Since	‎09-17-2013 08:36 PM
Last Visited	‎09-18-2019 01:00 PM
Posts	63
Kudos received	5

Cloudera Community

how many spark execturos runs for the below config...

how to execute oozie shell action with script havi...

When i set the scheduler from program, it impact t...

Re: Joining 3 pair-RDDs

How many times does the script used in spark pipes...

What is the reason behind Spark Functions extends ...

How to create spark PairRDD in scala .?

What does it mean, Spark persist call on its own d...

Is there a way that i can move the outputs to subf...

Re: MutlipleOutputs writing zero records to output...