Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

PYSPARK import pandas

avatar
Explorer

Hello, 

 

 I work with cloudera VM 5.4.2

 

I executed PYSPARK with the command

PYSPARK_DRIVER_PYTHON=ipython pyspark

After I try to import pandas

import pandas as pd

I get the following error

Using Python version 2.6.6 (r266:84292, Feb 22 2013 00:00:18)

SparkContext available as sc, HiveContext available as sqlContext.

In [1]: import pandas as pd

---------------------------------------------------------------------------

ImportError Traceback (most recent call last)

<ipython-input-1-af55e7023913> in <module>()

----> 1 import pandas as pd

/usr/lib/python2.6/site-packages/pandas-0.18.0-py2.6-linux-x86_64.egg/pandas/__init__.py in <module>()

20

21 # numpy compat

---> 22 from pandas.compat.numpy_compat import *

23

24 try:

/usr/lib/python2.6/site-packages/pandas-0.18.0-py2.6-linux-x86_64.egg/pandas/compat/__init__.py in <module>()

296 return wrapper

297

--> 298 from collections import OrderedDict, Counter

299

300 if PY3:

ImportError: cannot import name OrderedDict

In [2]:

 

Why can't I import Pandas?

 

 

Thanks in advance

Carlota Vina

6 REPLIES 6

avatar
Master Collaborator

The simplest explanation is that pandas isn't installed, of course. It's not part of Python. Consider using the Anaconda parcel to lay down a Python distribution for use with Pyspark that contains many commonly-used packages like pandas.

avatar
Explorer

Thanks for reply.

I executed anaconda3 

 

sudo yum install -y spark-core spark-master spark-worker spark-history-server spark-python


wget http://repo.continuum.io/archive/Anaconda3-4.0.0-Linux-x86_64.sh


bash Anaconda3-4.0.0-Linux-x86_64.sh

 

But I can't import pandas still

Thanks in advance

Carlota Vina

avatar
Explorer

Hello,

 

When I installed anaconda3 I have pandas.0.18.0 and python is 3.5

 

But when I executed PYSPARK the version of python is 2.6.6

 

 PYSPARK_DRIVER_PYTHON=ipython pyspark
Python 2.6.6 (r266:84292, Feb 22 2013, 00:00:18)
Type "copyright", "credits" or "license" for more information.

IPython 1.2.1 -- An enhanced Interactive Python.

 

Could this be the error?

Thanks in advance

Carlota Vina

avatar
Master Collaborator

Installing Anaconda doesn't make Pyspark use it. You would have to tell Pyspark to do so. I was referring to the Anaconda parcel for CDH, which does the setup, not the generic Anaconda distribution.

avatar
Explorer

Hello,

 

I have a .py and I want to execute instruction by instruction. Could you explain me how to do this?

Thanks in advance

Carlota Vina

avatar
Contributor

 

I would advise to use ipython's internal debugger ipdb. This debugger allows you to run every statement step by step.

http://quant-econ.net/py/ipython.html#debugging

https://docs.python.org/3/library/pdb.html

 

Finally regarding the other statements above when you using Anaconda's ipython remember to set the environment variable PYSPARK_PYTHON to the location of ipython (ex. /usr/bin/ipython) so PySpark knows where to find ipython.

 

Good luck.