Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark-submit suddenly doesn't recognize modules on the cluster

avatar
Explorer

Hi,

 

We're experiencing a peculiar issue these past few days - our python scripts started failing when run with spark-submit. On first glance our logs show that the scripts are encountering syntax errors whenever we are using code related to modules, but further troubleshooting showed that in actuality the modules are the issue. When we comment out pieces of code that throw syntax errors, we instead receive import errors ("No module named...").

 

We have two versions of python on our cluster but it appears that spark-submit still is using the proper python version with all our modules installed on it. Our scripts run just fine through pyspark, for some reason however, spark-submit does not recognize the imported modules when we run scripts through it.

 

What is more, YARN doesn't seem to recognize these jobs as failed, they are not logged at all, probably because they crash as soon as we start importing modules. So basically we do not have access to YARN logs for these jobs.

 

Any insight would be greatly appreciated.

 

Thanks.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hi @imule 

 

Add the following parameter to your spark-submit

 

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=<python3_path>
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=<python3_path>

 

Note:

1. Ensure python3_path exists in all nodes.

2. Ensure required modules are installed in each node.

View solution in original post

3 REPLIES 3

avatar
Super Collaborator

Hi @imule 

 

Add the following parameter to your spark-submit

 

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=<python3_path>
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=<python3_path>

 

Note:

1. Ensure python3_path exists in all nodes.

2. Ensure required modules are installed in each node.

avatar
Master Mentor

@imule 

Firstly I would like to ask if there were any changes in the cluster i.e Patching or rpm ?
if the spark-submit was running successfully before you need to know this could be linked to the python version


On the edge/gateway node

 

Spoiler

# python -V

# Python 3.7.7

# conda deactivate

# Python 2.7.5

 

Then try relaunching your spark-submit

avatar
Community Manager

@imule, Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: