Archives of Support Questions (Read Only)

toamitjain · ‎12-10-2017

I am getting below error when using Pandas Dataframes inside PySpark transformation code. But when I use Pandas dataframes anywhere outside PySpark transformation, it works without any problem.

Error:

ImportError: No module named indexes.base

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)

...

Error points towards the line where I am calling RDD.map() transformation.

Sample code below:

from pyspark.context import SparkContext
import pandas

CPR_loans = pandas.DataFrame(columns=["CPR", "loans"])
temp_vars = pandas.DataFrame(columns=['A','B','C'])

def processPeriods(period):
global accum
accum+=1
temp_vars['prepay_probability'] = 0.000008
temp_vars['CPR'] = 100 * (1- (1- temp_vars['prepay_probability'] ) **12 )
#return (100 * (1-0.000008) **12)
return temp_vars['CPR']

nr_periods=5
sc = SparkContext.getOrCreate()
periodListRDD = sc.parallelize(range(1, nr_periods))
accum = sc.accumulator(0)

rdd_list = periodListRDD.map(lambda period: processPeriods(period)).collect()
print "rdd_list = ", rdd_list
CPR_loans.append( rdd_list )

Please suggest how can I make it work?

Thanks a lot.

srowen · ‎12-22-2017

This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.

View solution in original post

toamitjain · ‎12-22-2017

Can someone please help me solve this issue. It is blocking our progress.

srowen · ‎12-22-2017

This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.

Cloudera Community

Archives of Support Questions (Read Only)

error using Pandas within PySpark transformation code