error using Pandas within PySpark transformation code

toamitjain — Fri, 16 Sep 2022 12:37:30 GMT

I am getting below error when using Pandas Dataframes inside PySpark transformation code. But when I use Pandas dataframes anywhere outside PySpark transformation, it works without any problem.

Error:

ImportError: No module named indexes.base

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)

...

Error points towards the line where I am calling RDD.map() transformation.

Sample code below:

from pyspark.context import SparkContext
import pandas

CPR_loans = pandas.DataFrame(columns=["CPR", "loans"])
temp_vars = pandas.DataFrame(columns=['A','B','C'])

def processPeriods(period):
global accum
accum+=1
temp_vars['prepay_probability'] = 0.000008
temp_vars['CPR'] = 100 * (1- (1- temp_vars['prepay_probability'] ) **12 )
#return (100 * (1-0.000008) **12)
return temp_vars['CPR']

nr_periods=5
sc = SparkContext.getOrCreate()
periodListRDD = sc.parallelize(range(1, nr_periods))
accum = sc.accumulator(0)

rdd_list = periodListRDD.map(lambda period: processPeriods(period)).collect()
print "rdd_list = ", rdd_list
CPR_loans.append( rdd_list )

Please suggest how can I make it work?

Thanks a lot.

Re: error using Pandas within PySpark transformation code

toamitjain — Fri, 22 Dec 2017 11:43:56 GMT

Can someone please help me solve this issue. It is blocking our progress.

Re: error using Pandas within PySpark transformation code

srowen — Fri, 22 Dec 2017 17:18:53 GMT

This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.

question error using Pandas within PySpark transformation code in Archives of Support Questions (Read Only)

error using Pandas within PySpark transformation code

Re: error using Pandas within PySpark transformation code

Re: error using Pandas within PySpark transformation code