Support Questions

Carlota · ‎07-05-2016

Hello,

I work with cloudera VM 5.4.2

I executed PYSPARK with the command

PYSPARK_DRIVER_PYTHON=ipython pyspark

After I try to import pandas

import pandas as pd

I get the following error

Using Python version 2.6.6 (r266:84292, Feb 22 2013 00:00:18)

SparkContext available as sc, HiveContext available as sqlContext.

In [1]: import pandas as pd

---------------------------------------------------------------------------

ImportError Traceback (most recent call last)

<ipython-input-1-af55e7023913> in <module>()

----> 1 import pandas as pd

/usr/lib/python2.6/site-packages/pandas-0.18.0-py2.6-linux-x86_64.egg/pandas/__init__.py in <module>()

20

21 # numpy compat

---> 22 from pandas.compat.numpy_compat import *

23

24 try:

/usr/lib/python2.6/site-packages/pandas-0.18.0-py2.6-linux-x86_64.egg/pandas/compat/__init__.py in <module>()

296 return wrapper

297

--> 298 from collections import OrderedDict, Counter

299

300 if PY3:

ImportError: cannot import name OrderedDict

In [2]:

Why can't I import Pandas?

Thanks in advance

Carlota Vina

srowen · ‎07-05-2016

The simplest explanation is that pandas isn't installed, of course. It's not part of Python. Consider using the Anaconda parcel to lay down a Python distribution for use with Pyspark that contains many commonly-used packages like pandas.

Carlota · ‎07-05-2016

Thanks for reply.

I executed anaconda3

sudo yum install -y spark-core spark-master spark-worker spark-history-server spark-python

wget http://repo.continuum.io/archive/Anaconda3-4.0.0-Linux-x86_64.sh

bash Anaconda3-4.0.0-Linux-x86_64.sh

But I can't import pandas still

Thanks in advance

Carlota Vina

Carlota · ‎07-05-2016

Hello,

When I installed anaconda3 I have pandas.0.18.0 and python is 3.5

But when I executed PYSPARK the version of python is 2.6.6

PYSPARK_DRIVER_PYTHON=ipython pyspark
Python 2.6.6 (r266:84292, Feb 22 2013, 00:00:18)
Type "copyright", "credits" or "license" for more information.

IPython 1.2.1 -- An enhanced Interactive Python.

Could this be the error?

Thanks in advance

Carlota Vina

srowen · ‎07-05-2016

Installing Anaconda doesn't make Pyspark use it. You would have to tell Pyspark to do so. I was referring to the Anaconda parcel for CDH, which does the setup, not the generic Anaconda distribution.

Carlota · ‎07-05-2016

Hello,

I have a .py and I want to execute instruction by instruction. Could you explain me how to do this?

Thanks in advance

Carlota Vina

MVERVUURT · ‎07-06-2016

I would advise to use ipython's internal debugger ipdb. This debugger allows you to run every statement step by step.

* http://quant-econ.net/py/ipython.html#debugging

* https://docs.python.org/3/library/pdb.html

Finally regarding the other statements above when you using Anaconda's ipython remember to set the environment variable PYSPARK_PYTHON to the location of ipython (ex. /usr/bin/ipython) so PySpark knows where to find ipython.

Good luck.

Cloudera Community

Support Questions

PYSPARK import pandas