03-10-2016 01:55 PM
How did you resolve the issue ?
@Sridhar Babu M Since cores per container are controlled by Yarn configuration, I believe you will need to set the number of executors and the number of cores per executor based on your Yarn configuration to control how many executors and cores get scheduled. So if you set Yarn to allocate 1 core per container and you want two cores for the job then ask for 2 executors with 1 core each from Spark submit. That should give you two containers with 1 executor each. I don't think Yarn will give you an executor with 2 cores if a container can only have 1 core. But if you can have 8 cores per container then you can have 8 executors with 1 core or 4 executors with 2 cores per container. Of course, you can continue to add executors as long as you your Yarn queue has capacity for more containers. # Run on a YARN cluster ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --executor-memory 2G --num-executors 2 --executor-cores 1 /path/to/examples.jar
Hi Babu - It's more of a common approach to write out a new file. HDFS is essentially an append only system so creating a new file that's a derivative of the original is a very common practice. You can write a MR program to output a file or use a Hive query to output a query results to a new file. For example, INSERT OVERWRITE DIRECTORY '/user/me/output' SELECT UPPER(myColumn) FROM myTable. This would create a new file(s) with a modified change that's like an update. In this case, we're upper casing the 'myColumn' in the myTable table.
Hi @Sridhar Babu, Apparently there is an issue with library in compatable with2.11:1.3.0 and 2.11:1.4.0 please use verison com.databricks:spark-csv_2.10:1.4.0
@Sridhar Babu M You can see the details of what Spark is doing by clicking on the application master in Resource Manager UI. When you click on the application master link for the Spark job in Resource Manager UI it will take you to the Spark UI and show you the job in detail. You may just have to make sure that the Spark History Server is running in Ambari or the page may come up blank. If you actually need to change the value in the file then you will need to export the resulting Data Frame to file. The save function that is part of DF class creates a files for each partition. If you need a single file you convert back to an RDD and use coalesce(1) to get everything down to a single partition so you get one file. Make sure that you add the dependency in Zeppelin %dep z.load("com.databricks:spark-csv_2.10:1.4.0") or spark-shell --packages com.databricks:spark-csv_2.10:1.4.0 import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode
case class Person(name: String, age: Int)
var personRDD = sc.textFile("/user/spark/people.txt")
var personDF =>x.split(",")).map(x=>Person(x(0),(x(1).trim.toInt))).toDF()
var personeDF = sqlContext.sql("SELECT * FROM people")
var agedPerson =>if(x.getAs[String]("name")=="Justin"){Person(x.getAs[String]("name"), x.getAs[Int]("age")+2)}else{Person(x.getAs[String]("name"), x.getAs[Int]("age"))}).toDF()
var agedPeopleDF = sqlContext.sql("SELECT * FROM people")"name", "age").write.format("com.databricks.spark.csv").mode(SaveMode.Overwrite).save("agedPeople")
var agedPeopleRDD = agedPeopleDF.rdd
Glad it worked out.
