Posts: 10
Registered: ‎12-05-2017
Accepted Solution

error using Pandas within PySpark transformation code

I am getting below error when using Pandas Dataframes inside PySpark transformation code. But when I use Pandas dataframes anywhere outside PySpark transformation, it works without any problem.



    ImportError: No module named indexes.base

    at org.apache.spark.api.python.PythonRunner$$anon$
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)  



Error points towards the line where I am calling transformation.


Sample code below: 


from pyspark.context import SparkContext
import pandas


CPR_loans = pandas.DataFrame(columns=["CPR", "loans"])
temp_vars = pandas.DataFrame(columns=['A','B','C'])


def processPeriods(period):
    global accum
    temp_vars['prepay_probability'] = 0.000008
    temp_vars['CPR'] = 100 * (1- (1- temp_vars['prepay_probability'] ) **12 )
    #return (100 * (1-0.000008) **12)
    return temp_vars['CPR']


sc = SparkContext.getOrCreate()
periodListRDD = sc.parallelize(range(1, nr_periods))
accum = sc.accumulator(0)

rdd_list = period: processPeriods(period)).collect()
print "rdd_list = ", rdd_list
CPR_loans.append( rdd_list )



Please suggest how can I make it work?

Thanks a lot. 

Posts: 10
Registered: ‎12-05-2017

Re: error using Pandas within PySpark transformation code

Can someone please help me solve this issue. It is blocking our progress.

Cloudera Employee
Posts: 481
Registered: ‎08-11-2014

Re: error using Pandas within PySpark transformation code

This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.