Support Questions

Find answers, ask questions, and share your expertise

Local PySpark in virtualEnv not connect to Data lake

avatar
New Contributor

Hello

 

Inside a Cloudera default installation with Spark I create and activate a Python Virtual Environment with all the libraries that I need.

The only problem that I have is with PySpark library inside VirtualEnv because not connect to tables in the data lake like parquets tables to make queries.

 

Whe I use the default library of Pyspark OUTSIDE the VirtualEnv in the default installation of Cloudera with Spark. I don't have problems to make the queries, it works.  

Can you help me please with a solution to use Pyspark inside the Python VirtualEnv and make queries to tables in the data lake.

Thanks!!


4 REPLIES 4

avatar
Community Manager

@hightek2699 Welcome to our community! To help you get the best possible answer, I have tagged in our Spark expert @RangaReddy  who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Master Collaborator

Hi @hightek2699 

 

Could you please share me the what is the exact issue when you are running inside virtual environment. And also provide the steps what you have followed.

avatar
New Contributor

Good morning 

 

@RangaReddy thanks by your help, the exact issue is with PySpark library inside VirtualEnv because not connect to tables in the data lake like parquets tables to make queries the error messages is "Path does not exist".  

 

hightek2699_0-1687264599453.png

 

Whe I use the default library of Pyspark OUTSIDE the VirtualEnv in the default installation of Cloudera with Spark. I don't have problems to make the queries, it works. 

 

 

I have change this configuration but returns to the default configuration of Spark where the binary files are located

 

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

 

 

The steps that I follow are:

 

1. Create the environment with the desired version of Python

    Python36 -m venv <environment_name>

 

2. Activate the created environment

    source <environment_name>/bin/activate

 

3. pip install pyspark

 

 

I refer to this documentation, but I don't get activate my local installation of Pyspark


A Case for Isolated Virtual Environments with PySpark - inovex GmbH

Thanks!

avatar
Master Collaborator

Hi @hightek2699 

Don't install pyspark manually using pip install command. Use the cloudera provided pyspark.