Reply
Explorer
Posts: 7
Registered: ‎12-05-2017
Accepted Solution

error using Pandas within PySpark transformation code

I am getting below error when using Pandas Dataframes inside PySpark transformation code. But when I use Pandas dataframes anywhere outside PySpark transformation, it works without any problem.

 

Error: 

    ImportError: No module named indexes.base

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)  

    ...

 

Error points towards the line where I am calling RDD.map() transformation.

 

Sample code below: 

 

from pyspark.context import SparkContext
import pandas

 

CPR_loans = pandas.DataFrame(columns=["CPR", "loans"])
temp_vars = pandas.DataFrame(columns=['A','B','C'])

 

def processPeriods(period):
    global accum
    accum+=1
    temp_vars['prepay_probability'] = 0.000008
    temp_vars['CPR'] = 100 * (1- (1- temp_vars['prepay_probability'] ) **12 )
    #return (100 * (1-0.000008) **12)
    return temp_vars['CPR']

 

nr_periods=5
sc = SparkContext.getOrCreate()
periodListRDD = sc.parallelize(range(1, nr_periods))
accum = sc.accumulator(0)

rdd_list = periodListRDD.map(lambda period: processPeriods(period)).collect()
print "rdd_list = ", rdd_list
CPR_loans.append( rdd_list )

 

 

Please suggest how can I make it work?

Thanks a lot. 

Explorer
Posts: 7
Registered: ‎12-05-2017

Re: error using Pandas within PySpark transformation code

Can someone please help me solve this issue. It is blocking our progress.

Highlighted
Cloudera Employee
Posts: 481
Registered: ‎08-11-2014

Re: error using Pandas within PySpark transformation code

This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.

Announcements