Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

PYSPARK import pandas

avatar
Explorer

Hello, 

 

 I work with cloudera VM 5.4.2

 

I executed PYSPARK with the command

PYSPARK_DRIVER_PYTHON=ipython pyspark

After I try to import pandas

import pandas as pd

I get the following error

Using Python version 2.6.6 (r266:84292, Feb 22 2013 00:00:18)

SparkContext available as sc, HiveContext available as sqlContext.

In [1]: import pandas as pd

---------------------------------------------------------------------------

ImportError Traceback (most recent call last)

<ipython-input-1-af55e7023913> in <module>()

----> 1 import pandas as pd

/usr/lib/python2.6/site-packages/pandas-0.18.0-py2.6-linux-x86_64.egg/pandas/__init__.py in <module>()

20

21 # numpy compat

---> 22 from pandas.compat.numpy_compat import *

23

24 try:

/usr/lib/python2.6/site-packages/pandas-0.18.0-py2.6-linux-x86_64.egg/pandas/compat/__init__.py in <module>()

296 return wrapper

297

--> 298 from collections import OrderedDict, Counter

299

300 if PY3:

ImportError: cannot import name OrderedDict

In [2]:

 

Why can't I import Pandas?

 

 

Thanks in advance

Carlota Vina

6 REPLIES 6

avatar

The simplest explanation is that pandas isn't installed, of course. It's not part of Python. Consider using the Anaconda parcel to lay down a Python distribution for use with Pyspark that contains many commonly-used packages like pandas.

avatar
Explorer

Thanks for reply.

I executed anaconda3 

 

sudo yum install -y spark-core spark-master spark-worker spark-history-server spark-python


wget http://repo.continuum.io/archive/Anaconda3-4.0.0-Linux-x86_64.sh


bash Anaconda3-4.0.0-Linux-x86_64.sh

 

But I can't import pandas still

Thanks in advance

Carlota Vina

avatar
Explorer

Hello,

 

When I installed anaconda3 I have pandas.0.18.0 and python is 3.5

 

But when I executed PYSPARK the version of python is 2.6.6

 

 PYSPARK_DRIVER_PYTHON=ipython pyspark
Python 2.6.6 (r266:84292, Feb 22 2013, 00:00:18)
Type "copyright", "credits" or "license" for more information.

IPython 1.2.1 -- An enhanced Interactive Python.

 

Could this be the error?

Thanks in advance

Carlota Vina

avatar

Installing Anaconda doesn't make Pyspark use it. You would have to tell Pyspark to do so. I was referring to the Anaconda parcel for CDH, which does the setup, not the generic Anaconda distribution.

avatar
Explorer

Hello,

 

I have a .py and I want to execute instruction by instruction. Could you explain me how to do this?

Thanks in advance

Carlota Vina

avatar
Contributor

 

I would advise to use ipython's internal debugger ipdb. This debugger allows you to run every statement step by step.

http://quant-econ.net/py/ipython.html#debugging

https://docs.python.org/3/library/pdb.html

 

Finally regarding the other statements above when you using Anaconda's ipython remember to set the environment variable PYSPARK_PYTHON to the location of ipython (ex. /usr/bin/ipython) so PySpark knows where to find ipython.

 

Good luck.