Support Questions

nanyim_alain · ‎05-13-2016

Hello,

I would like to know please, by what method (or line of code) is that I can be convinced that treatment is executed on all my cluster node with Pyspark?

thank you kindly help me

here is my code:

from pyspark.sql.types import *
from pyspark.sql import Row
           		   
rdd = sc.textFile('hdfs:../personne.txt') 
rdd_split = rdd.map(lambda x: x.split(','))
rdd_people = rdd_split.map(lambda x: Row(name=x[0],age=int(x[1]),ca=int(x[2])))
df_people = sqlContext.createDataFrame(rdd_people)
df_people.registerTempTable("people")
df_people.collect()

phargis · ‎05-13-2016

@Andrew Sears answer is correct, and once you bring up the Spark History Server URL (http://{driver-node}:4040), you can navigate to the Executors tab, which will show you lots of statistics about the driver and each executor, as shown below. Note that when running Hortonworks Data Platform (HDP), you can get here from the Spark services page, clicking on "Quick Links", and then clicking on the "Spark History Server UI" button. Following that, you will need to find your specific job under "App ID".

View solution in original post

andrew_sears · ‎05-13-2016

If you are looking for a way to monitor the job and determine which nodes it ran on, how many executors, etc, you can see this in the Spark Web UI located at <sparkhost>:4040

http://spark.apache.org/docs/latest/monitoring.html

http://stackoverflow.com/questions/35059608/pyspark-on-cluster-make-sure-all-nodes-are-used

cheers,

Andrew

nanyim_alain · ‎05-19-2016

Thank very much

nanyim_alain · ‎05-19-2016

Very big thnak you

phargis · ‎05-13-2016

@Andrew Sears answer is correct, and once you bring up the Spark History Server URL (http://{driver-node}:4040), you can navigate to the Executors tab, which will show you lots of statistics about the driver and each executor, as shown below. Note that when running Hortonworks Data Platform (HDP), you can get here from the Spark services page, clicking on "Quick Links", and then clicking on the "Spark History Server UI" button. Following that, you will need to find your specific job under "App ID".

Cloudera Community

Support Questions

distributed processing operation of dataframe with Pyspark