<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question error using Pandas within PySpark transformation code in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/error-using-Pandas-within-PySpark-transformation-code/m-p/62653#M72518</link>
    <description>&lt;P&gt;I am getting below error when using&amp;nbsp;Pandas Dataframes inside PySpark transformation code. But when I use Pandas dataframes anywhere outside&amp;nbsp;&lt;SPAN&gt;PySpark transformation, it works without any problem.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Error:&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; ImportError: No module named indexes.base&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)&lt;BR /&gt;&amp;nbsp; &amp;nbsp; at org.apache.spark.api.python.PythonRunner$$anon$1.&amp;lt;init&amp;gt;(PythonRDD.scala:234)&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; ...&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Error points towards the line where I am calling RDD.map() transformation.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Sample code below:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;from pyspark.context import SparkContext&lt;BR /&gt;import pandas&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;CPR_loans = pandas.DataFrame(columns=["CPR", "loans"])&lt;BR /&gt;temp_vars = pandas.DataFrame(columns=['A','B','C'])&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;def processPeriods(period):&lt;BR /&gt;&amp;nbsp; &amp;nbsp; global accum&lt;BR /&gt;&amp;nbsp; &amp;nbsp; accum+=1&lt;BR /&gt;&amp;nbsp; &amp;nbsp; temp_vars['prepay_probability'] = 0.000008&lt;BR /&gt;&amp;nbsp; &amp;nbsp; temp_vars['CPR'] = 100 * (1- (1- temp_vars['prepay_probability'] ) **12 )&lt;BR /&gt;&amp;nbsp; &amp;nbsp; #return (100 * (1-0.000008) **12)&lt;BR /&gt;&amp;nbsp; &amp;nbsp; return temp_vars['CPR']&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;nr_periods=5&lt;BR /&gt;sc = SparkContext.getOrCreate()&lt;BR /&gt;periodListRDD = sc.parallelize(range(1, nr_periods))&lt;BR /&gt;accum = sc.accumulator(0)&lt;/P&gt;&lt;P&gt;rdd_list = periodListRDD.map(lambda period: processPeriods(period)).collect()&lt;BR /&gt;print "rdd_list = ", rdd_list&lt;BR /&gt;CPR_loans.append( rdd_list )&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Please suggest how can I make it work?&lt;/P&gt;&lt;P&gt;Thanks a lot.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 12:37:30 GMT</pubDate>
    <dc:creator>toamitjain</dc:creator>
    <dc:date>2022-09-16T12:37:30Z</dc:date>
    <item>
      <title>error using Pandas within PySpark transformation code</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/error-using-Pandas-within-PySpark-transformation-code/m-p/62653#M72518</link>
      <description>&lt;P&gt;I am getting below error when using&amp;nbsp;Pandas Dataframes inside PySpark transformation code. But when I use Pandas dataframes anywhere outside&amp;nbsp;&lt;SPAN&gt;PySpark transformation, it works without any problem.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Error:&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; ImportError: No module named indexes.base&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)&lt;BR /&gt;&amp;nbsp; &amp;nbsp; at org.apache.spark.api.python.PythonRunner$$anon$1.&amp;lt;init&amp;gt;(PythonRDD.scala:234)&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; ...&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Error points towards the line where I am calling RDD.map() transformation.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Sample code below:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;from pyspark.context import SparkContext&lt;BR /&gt;import pandas&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;CPR_loans = pandas.DataFrame(columns=["CPR", "loans"])&lt;BR /&gt;temp_vars = pandas.DataFrame(columns=['A','B','C'])&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;def processPeriods(period):&lt;BR /&gt;&amp;nbsp; &amp;nbsp; global accum&lt;BR /&gt;&amp;nbsp; &amp;nbsp; accum+=1&lt;BR /&gt;&amp;nbsp; &amp;nbsp; temp_vars['prepay_probability'] = 0.000008&lt;BR /&gt;&amp;nbsp; &amp;nbsp; temp_vars['CPR'] = 100 * (1- (1- temp_vars['prepay_probability'] ) **12 )&lt;BR /&gt;&amp;nbsp; &amp;nbsp; #return (100 * (1-0.000008) **12)&lt;BR /&gt;&amp;nbsp; &amp;nbsp; return temp_vars['CPR']&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;nr_periods=5&lt;BR /&gt;sc = SparkContext.getOrCreate()&lt;BR /&gt;periodListRDD = sc.parallelize(range(1, nr_periods))&lt;BR /&gt;accum = sc.accumulator(0)&lt;/P&gt;&lt;P&gt;rdd_list = periodListRDD.map(lambda period: processPeriods(period)).collect()&lt;BR /&gt;print "rdd_list = ", rdd_list&lt;BR /&gt;CPR_loans.append( rdd_list )&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Please suggest how can I make it work?&lt;/P&gt;&lt;P&gt;Thanks a lot.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 12:37:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/error-using-Pandas-within-PySpark-transformation-code/m-p/62653#M72518</guid>
      <dc:creator>toamitjain</dc:creator>
      <dc:date>2022-09-16T12:37:30Z</dc:date>
    </item>
    <item>
      <title>Re: error using Pandas within PySpark transformation code</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/error-using-Pandas-within-PySpark-transformation-code/m-p/63001#M72519</link>
      <description>&lt;P&gt;Can someone please help me solve this issue. It is blocking our progress.&lt;/P&gt;</description>
      <pubDate>Fri, 22 Dec 2017 11:43:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/error-using-Pandas-within-PySpark-transformation-code/m-p/63001#M72519</guid>
      <dc:creator>toamitjain</dc:creator>
      <dc:date>2017-12-22T11:43:56Z</dc:date>
    </item>
    <item>
      <title>Re: error using Pandas within PySpark transformation code</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/error-using-Pandas-within-PySpark-transformation-code/m-p/63013#M72520</link>
      <description>&lt;P&gt;This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.&lt;/P&gt;</description>
      <pubDate>Fri, 22 Dec 2017 17:18:53 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/error-using-Pandas-within-PySpark-transformation-code/m-p/63013#M72520</guid>
      <dc:creator>srowen</dc:creator>
      <dc:date>2017-12-22T17:18:53Z</dc:date>
    </item>
  </channel>
</rss>

