Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

error using Pandas within PySpark transformation code

Solved Go to solution
Highlighted

error using Pandas within PySpark transformation code

Explorer

I am getting below error when using Pandas Dataframes inside PySpark transformation code. But when I use Pandas dataframes anywhere outside PySpark transformation, it works without any problem.

 

Error: 

    ImportError: No module named indexes.base

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)  

    ...

 

Error points towards the line where I am calling RDD.map() transformation.

 

Sample code below: 

 

from pyspark.context import SparkContext
import pandas

 

CPR_loans = pandas.DataFrame(columns=["CPR", "loans"])
temp_vars = pandas.DataFrame(columns=['A','B','C'])

 

def processPeriods(period):
    global accum
    accum+=1
    temp_vars['prepay_probability'] = 0.000008
    temp_vars['CPR'] = 100 * (1- (1- temp_vars['prepay_probability'] ) **12 )
    #return (100 * (1-0.000008) **12)
    return temp_vars['CPR']

 

nr_periods=5
sc = SparkContext.getOrCreate()
periodListRDD = sc.parallelize(range(1, nr_periods))
accum = sc.accumulator(0)

rdd_list = periodListRDD.map(lambda period: processPeriods(period)).collect()
print "rdd_list = ", rdd_list
CPR_loans.append( rdd_list )

 

 

Please suggest how can I make it work?

Thanks a lot. 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: error using Pandas within PySpark transformation code

Master Collaborator

This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.

2 REPLIES 2

Re: error using Pandas within PySpark transformation code

Explorer

Can someone please help me solve this issue. It is blocking our progress.

Re: error using Pandas within PySpark transformation code

Master Collaborator

This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.