I am facing a lot of issues integrating/adding Pyspark dataframes to existing Pandas code.
1) If I convert Pandas dataframes to Pyspark dataframes, multiple operations do not translate well since Pyspark dataframes do not seem to be as rich as Pandas dataframes.
2) If I choose to use Pyspark dataframes and Pandas to handle different datasets within the same code, Pyspark transformations(like map) do not seem to work at all when the function called through map contains any pandas dataframes.
I have existing code in Python that uses pandas, numpy and Impala; and works fine on a single machine. My initial attempt to translate the entire code to Spark dataframes failed since Spark dataframes do not support many operations that Pandas does.
Now, I am trying to apply pyspark to the existing code to gain from Pyspark's distributed computations. Using Spark 2.1.0(Cloudera parcel) and Anaconda distribution - with Python 2.7.14.
Are Pyspark and Pandas certified to work together? Any good references where I can find documentation and examples of using them together?
Your responses will be highly appreciated.
No need to ping. As far as I know nobody certifies pandas-Spark integration. We support Pyspark. It has a minimal integration with pandas (e.g. the toPandas method). If there were a Pyspark-side issue we'd try to fix it. But we don't support pandas.