Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

error using Pandas within PySpark transformation code

avatar
Explorer

I am getting below error when using Pandas Dataframes inside PySpark transformation code. But when I use Pandas dataframes anywhere outside PySpark transformation, it works without any problem.

 

Error: 

    ImportError: No module named indexes.base

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)  

    ...

 

Error points towards the line where I am calling RDD.map() transformation.

 

Sample code below: 

 

from pyspark.context import SparkContext
import pandas

 

CPR_loans = pandas.DataFrame(columns=["CPR", "loans"])
temp_vars = pandas.DataFrame(columns=['A','B','C'])

 

def processPeriods(period):
    global accum
    accum+=1
    temp_vars['prepay_probability'] = 0.000008
    temp_vars['CPR'] = 100 * (1- (1- temp_vars['prepay_probability'] ) **12 )
    #return (100 * (1-0.000008) **12)
    return temp_vars['CPR']

 

nr_periods=5
sc = SparkContext.getOrCreate()
periodListRDD = sc.parallelize(range(1, nr_periods))
accum = sc.accumulator(0)

rdd_list = periodListRDD.map(lambda period: processPeriods(period)).collect()
print "rdd_list = ", rdd_list
CPR_loans.append( rdd_list )

 

 

Please suggest how can I make it work?

Thanks a lot. 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.

View solution in original post

2 REPLIES 2

avatar
Explorer

Can someone please help me solve this issue. It is blocking our progress.

avatar
Master Collaborator

This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.