Support Questions
Find answers, ask questions, and share your expertise

Jupyter Notebook PySpark kernel using a lowered pip version

New Contributor

I am using a Jupyter Notebook which is provided by an AWS managed service called EMR Studio. My understanding of how these notebooks work is that they are hosted on EC2 instances that I provision as part of my EMR cluster. Specifically with the PySpark kernel using the task nodes.

Currently when I run the command `sc.list_packages()` I see that pip is at version 9.0.1 whereas if I SSH onto the master node and run `pip list` I see that pip is at version 20.2.2. I have issues running the command `sc.install_pypi_package()` due to the lowered pip version in the Notebook.

In the notebook cell if I run `import pip` then `pip` I see that the module is located at

<module 'pip' from '/mnt1/yarn/usercache/<LIVY_IMPERSONATION_ROLE>/appcache/application_1652110228490_0001/container_1652110228490_0001_01_000001/tmp/1652113783466-0/lib/python3.7/site-packages/pip/'>

I am assuming this is most likely within a virtualenv of some sort running as an application on the task node? I am unsure of this and I have no concrete evidence of how the virtualenv is provisioned if there is one.

If I run `sc.uninstall_package('pip')` then `sc.list_packages()` I see pip at a version of 20.2.2 which is what I am looking to initially start off with. The module path is the same as previously mentioned.

How can I get pip 20.2.2 in the virtualenv instead of pip 9.0.1?

If I import a package like numpy I see that the module is located at a different location from where pip is. Any reason for this?
<module 'numpy' from '/usr/local/lib64/python3.7/site-packages/numpy/'>

As for pip 9.0.1 the only reference I can find at the moment is in `/lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl`. One directory outside of this I see a file called `virtualenv-15.1.0-py2.7.egg-info` which if I `cat` the file states that it upgrades to pip 9.0.1. I have tried to remove the pip 9.0.1 wheel file and replaced it with a pip 20.2.2 wheel which caused issues with the PySpark kernel being able to provision properly. There is also a `` file which does reference a `__version__ = "15.1.0"`.

; ;