Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

distributed processing operation of dataframe with Pyspark

avatar
Expert Contributor

Hello,

I would like to know please, by what method (or line of code) is that I can be convinced that treatment is executed on all my cluster node with Pyspark?

thank you kindly help me

here is my code:

from pyspark.sql.types import *
from pyspark.sql import Row
           		   
rdd = sc.textFile('hdfs:../personne.txt') 
rdd_split = rdd.map(lambda x: x.split(','))
rdd_people = rdd_split.map(lambda x: Row(name=x[0],age=int(x[1]),ca=int(x[2])))
df_people = sqlContext.createDataFrame(rdd_people)
df_people.registerTempTable("people")
df_people.collect()


1 ACCEPTED SOLUTION

avatar

@Andrew Sears answer is correct, and once you bring up the Spark History Server URL (http://{driver-node}:4040), you can navigate to the Executors tab, which will show you lots of statistics about the driver and each executor, as shown below. Note that when running Hortonworks Data Platform (HDP), you can get here from the Spark services page, clicking on "Quick Links", and then clicking on the "Spark History Server UI" button. Following that, you will need to find your specific job under "App ID".

4217-sparkhistoryserver-executors.png

View solution in original post

4 REPLIES 4

avatar
Contributor

If you are looking for a way to monitor the job and determine which nodes it ran on, how many executors, etc, you can see this in the Spark Web UI located at <sparkhost>:4040

http://spark.apache.org/docs/latest/monitoring.html

http://stackoverflow.com/questions/35059608/pyspark-on-cluster-make-sure-all-nodes-are-used

cheers,

Andrew

avatar
Expert Contributor

Thank very much

avatar
Expert Contributor

Very big thnak you

avatar

@Andrew Sears answer is correct, and once you bring up the Spark History Server URL (http://{driver-node}:4040), you can navigate to the Executors tab, which will show you lots of statistics about the driver and each executor, as shown below. Note that when running Hortonworks Data Platform (HDP), you can get here from the Spark services page, clicking on "Quick Links", and then clicking on the "Spark History Server UI" button. Following that, you will need to find your specific job under "App ID".

4217-sparkhistoryserver-executors.png