Created on 12-10-2017 01:09 PM - edited 09-16-2022 05:37 AM
I am getting below error when using Pandas Dataframes inside PySpark transformation code. But when I use Pandas dataframes anywhere outside PySpark transformation, it works without any problem.
Error:
ImportError: No module named indexes.base
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
...
Error points towards the line where I am calling RDD.map() transformation.
Sample code below:
from pyspark.context import SparkContext
import pandas
CPR_loans = pandas.DataFrame(columns=["CPR", "loans"])
temp_vars = pandas.DataFrame(columns=['A','B','C'])
def processPeriods(period):
global accum
accum+=1
temp_vars['prepay_probability'] = 0.000008
temp_vars['CPR'] = 100 * (1- (1- temp_vars['prepay_probability'] ) **12 )
#return (100 * (1-0.000008) **12)
return temp_vars['CPR']
nr_periods=5
sc = SparkContext.getOrCreate()
periodListRDD = sc.parallelize(range(1, nr_periods))
accum = sc.accumulator(0)
rdd_list = periodListRDD.map(lambda period: processPeriods(period)).collect()
print "rdd_list = ", rdd_list
CPR_loans.append( rdd_list )
Please suggest how can I make it work?
Thanks a lot.
Created 12-22-2017 09:18 AM
This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.
Created 12-22-2017 03:43 AM
Can someone please help me solve this issue. It is blocking our progress.
Created 12-22-2017 09:18 AM
This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.