Support Questions

hightek2699 · ‎06-07-2023

Hello

Inside a Cloudera default installation with Spark I create and activate a Python Virtual Environment with all the libraries that I need.

The only problem that I have is with PySpark library inside VirtualEnv because not connect to tables in the data lake like parquets tables to make queries.

Whe I use the default library of Pyspark OUTSIDE the VirtualEnv in the default installation of Cloudera with Spark. I don't have problems to make the queries, it works.

Can you help me please with a solution to use Pyspark inside the Python VirtualEnv and make queries to tables in the data lake.

Thanks!!

VidyaSargur · ‎06-08-2023

@hightek2699 Welcome to our community! To help you get the best possible answer, I have tagged in our Spark expert @RangaReddy who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

RangaReddy · ‎06-12-2023

Hi @hightek2699

Could you please share me the what is the exact issue when you are running inside virtual environment. And also provide the steps what you have followed.

hightek2699 · ‎06-20-2023

Good morning

@RangaReddy thanks by your help, the exact issue is with PySpark library inside VirtualEnv because not connect to tables in the data lake like parquets tables to make queries the error messages is "Path does not exist".

Whe I use the default library of Pyspark OUTSIDE the VirtualEnv in the default installation of Cloudera with Spark. I don't have problems to make the queries, it works.

I have change this configuration but returns to the default configuration of Spark where the binary files are located

export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark

The steps that I follow are:

1. Create the environment with the desired version of Python

Python36 -m venv <environment_name>

2. Activate the created environment

source <environment_name>/bin/activate

3. pip install pyspark

I refer to this documentation, but I don't get activate my local installation of Pyspark

A Case for Isolated Virtual Environments with PySpark - inovex GmbH

Thanks!

RangaReddy · ‎06-29-2023

Hi @hightek2699

Don't install pyspark manually using pip install command. Use the cloudera provided pyspark.

Support Questions

Local PySpark in virtualEnv not connect to Data lake