Member since
09-17-2013
63
Posts
5
Kudos Received
0
Solutions
12-18-2015
09:52 PM
I am using spark standalone cluster and below are my spark-env properties. export SPARK_EXECUTOR_INSTANCES=432 export SPARK_EXECUTOR_CORES=24 export SPARK_EXECUTOR_MEMORY=36G export SPARK_DRIVER_MEMORY=24G I have 6 worker nodes and if i tried to run a job that has huge size of files and joins, it is getting stuck and failing. I could see 6 executors for the job with 24GB.
Could you please provide me any links or details to tune it and understand the worker nodes and executors concepts. I referred one cloudera blog, but that is more about yarn. But, i need it for spark standalone cluster
... View more
Labels:
- Labels:
-
Apache Spark
12-08-2015
04:41 PM
Hi, I need to create oozie workflow that exeuctes a shell script. The shell script has curl command which downloads a specific file from the download link. As commands in shell scripts are only able to recognize hdfs directories, how could i execute the script.? Sample code: curl -o ~/test.jar http://central.maven.org/maven2/commons-lang/commons-lang/2.6/commons-lang-2.6.jar hdfs dfs -copyFromLocal ~/test.jar /user/sr/test2
... View more
Labels:
- Labels:
-
Apache Oozie
-
HDFS
08-26-2015
03:37 AM
Hi, My cluster is using Capacity scheduler. But, in my mapreduce program, i have used the below configuration conf.set("yarn.resourcemanager.scheduler.class", "org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"); to set the the scheduler class to Fair scheduler. 1) Does this change really take place.? as the configuration and resources are designed for capacity scheduler, does it really change the things for my application.? 2) What happens if any one submits other application at the same time.? Does this change have any impact on the other jobs.? As far as my understanding goes, NO.. Thank You.. Srini.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache YARN
-
MapReduce
08-26-2015
03:04 AM
How about using cogroup.? Sparks' co group can work on 3 RDDs at once. The below is scala cogroup syntax i have checked, it says, it can combine two RDDs other1 and other2 at the same time. def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))] For each key k in this or other1 or other2, return a resulting RDD that contains a tuple with the list of values for that key in this, other1 and other2. I cannot work on spark as i do not have set up at office, otherwise, would love to try this. After cogroup, you can apply mapValues and merge the three sequences Thank You.
... View more
08-18-2015
09:06 AM
I tried the below spark scala code and got the output as mentioned below. I have tried to pass the inputs to script, but it didn't receive and when i used collect the print statement i used in the script appeared twice. My simple and very basic perl script first: #!/usr/bin/perl print("arguments $ARGV[0] \n"); // Just print the arguments. My Spark code: object PipesExample { def main(args:Array[String]){ val conf = new SparkConf(); val sc = new SparkContext(conf); val distScript = "/home/srinivas/test.pl" sc.addFile(distScript) val rdd = sc.parallelize(Array("srini")) val piped = rdd.pipe(Seq(SparkFiles.get("test.pl"))) println(" output " + piped.collect().mkString(" ")); } } Output looked like this.. output: arguments arguments 1) What mistake i have done to make it fail receiving the arguments.? 2) Why it executed twice.? If it looks too basic, please apologize me. I was trying to understand to the best and want to clear my doubts.
... View more
Labels:
- Labels:
-
Apache Spark
08-05-2015
11:08 PM
I want to get exact reason behind having functions serializable in Spark and want to know the if possible want to know the scenarios, where can be issues because of Serialization, As far as my understanding goes, to ensure seam less no side-effect parallel processing, instead of sending the data liike imperative paradigm, function will be sent to the node and data gets processed parallely. Is my above thought Correct.??? As far as my study, Functional programming is a very good way forward for parallel processing/concurrent programming, so i thought this is the reason. As we are passing function, is it the security reason behind having functions serializable.? Thanks In advance.
... View more
Labels:
- Labels:
-
Apache Spark
-
Security
08-05-2015
12:31 PM
I am trying to verify cogroup join and groupByKey for PairRDDs. I could check that in Spark Java API. But, cannot do it with scala project Below is the simple code that i tried, let me know where i made mistake. object PairsCheck { def main(args: Array[String]) = { val conf = new SparkConf; val sc = new SparkContext(conf) val lines = sc.textFile("/home/test1.txt") val lines2 = sc.textFile("/home/test2.txt") val words = lines.flatMap { x => x.split("\\W+") } val words2 = lines2.flatMap { x => x.split("\\W+") } val pairs: RDD[(Int, String)] = words.map {case(x) => (x.length(), x) } val pairs2: RDD[(Int, String)] = words2.map {case(x) => (x.length(), x) } import org.apache.spark.SparkContext._ // --> Here i tried to call join/co group functions that applies for pairsRDD, but could not do that. If i call join, it is throwing error. Thank you in advance.
... View more
Labels:
- Labels:
-
Apache Spark
08-05-2015
10:25 AM
I was recollecing few information about spark in documentation and crossed the below point " Spark persist call on its own does not force evaluation.?"
... View more
Labels:
- Labels:
-
Apache Spark
07-20-2015
09:29 PM
Hi, I am trying a solution, in which i have to take files recursively from the folder and perform few calculations and send the required data to output folder. There is little dependency on the date of the file. I am using multipleoutputs for now, to have that date in file name. But, it will be very good, if i there is a chance of creating sub folders in the output path with dates and move the respective files into that folder. One way is to use FileSystem and create required directory and write the file output, without any output from mapreduce. But, is there a way i can make mapreduce to write into required sub folder.? might not be possible, but just want to give a thought.
... View more
Labels:
- Labels:
-
MapReduce
07-20-2015
09:21 PM
Thank you Harsh
... View more