<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Spark R and Python Libraries in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-R-and-Python-Libraries/m-p/146661#M19970</link>
    <description>&lt;P&gt;It appear spark does not leverage full R and python libraries. I would like to understand why.  Any feedback?&lt;/P&gt;</description>
    <pubDate>Thu, 18 Feb 2016 05:26:12 GMT</pubDate>
    <dc:creator>sunile_manjee</dc:creator>
    <dc:date>2016-02-18T05:26:12Z</dc:date>
    <item>
      <title>Spark R and Python Libraries</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-R-and-Python-Libraries/m-p/146661#M19970</link>
      <description>&lt;P&gt;It appear spark does not leverage full R and python libraries. I would like to understand why.  Any feedback?&lt;/P&gt;</description>
      <pubDate>Thu, 18 Feb 2016 05:26:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-R-and-Python-Libraries/m-p/146661#M19970</guid>
      <dc:creator>sunile_manjee</dc:creator>
      <dc:date>2016-02-18T05:26:12Z</dc:date>
    </item>
    <item>
      <title>Re: Spark R and Python Libraries</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-R-and-Python-Libraries/m-p/146662#M19971</link>
      <description>&lt;P&gt;Spark has a PySpark class that acts as a wrapper around Spark's scala-based libraries. It also provides REPL interface for the python interpreter. If you launch pySpark, you will be able to import whatever python libraries you have installed locally, i.e. python imports should work. Specifically (from the docs):&lt;/P&gt;&lt;P&gt;PySpark requires Python 2.6 or higher. PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions. We have not tested PySpark with Python 3 or with alternative Python interpreters, such as &lt;A href="http://pypy.org/"&gt;PyPy&lt;/A&gt; or &lt;A href="http://www.jython.org/"&gt;Jython&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;By default, PySpark requires &lt;CODE&gt;python&lt;/CODE&gt; to be available on the system &lt;CODE&gt;PATH&lt;/CODE&gt; and use it to run programs; an alternate Python executable may be specified by setting the &lt;CODE&gt;PYSPARK_PYTHON&lt;/CODE&gt; environment variable in &lt;CODE&gt;conf/spark-env.sh&lt;/CODE&gt; (or&lt;CODE&gt;.cmd&lt;/CODE&gt; on Windows).&lt;/P&gt;&lt;P&gt;All of PySpark’s library dependencies, including &lt;A href="http://py4j.sourceforge.net/"&gt;Py4J&lt;/A&gt;, are bundled with PySpark and automatically imported.&lt;/P&gt;&lt;P&gt;Standalone PySpark applications should be run using the &lt;CODE&gt;bin/pyspark&lt;/CODE&gt; script, which automatically configures the Java and Python environment using the settings in &lt;CODE&gt;conf/spark-env.sh&lt;/CODE&gt; or &lt;CODE&gt;.cmd&lt;/CODE&gt;. The script automatically adds the&lt;CODE&gt;bin/pyspark&lt;/CODE&gt; package to the &lt;CODE&gt;PYTHONPATH&lt;/CODE&gt;.&lt;/P&gt;</description>
      <pubDate>Thu, 18 Feb 2016 06:46:58 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-R-and-Python-Libraries/m-p/146662#M19971</guid>
      <dc:creator>phargis</dc:creator>
      <dc:date>2016-02-18T06:46:58Z</dc:date>
    </item>
    <item>
      <title>Re: Spark R and Python Libraries</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-R-and-Python-Libraries/m-p/146663#M19972</link>
      <description>&lt;P&gt;You mean the R functions you can use on SparkR dataframes? The problem here is that the R functions used on the dataframes need to be translated into Spark functions otherwise they would not run in parallel inside the engine. So  this is a subset. &lt;/P&gt;</description>
      <pubDate>Thu, 18 Feb 2016 07:11:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-R-and-Python-Libraries/m-p/146663#M19972</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-02-18T07:11:07Z</dc:date>
    </item>
    <item>
      <title>Re: Spark R and Python Libraries</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-R-and-Python-Libraries/m-p/146664#M19973</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/168/bleonhardi.html" nodeid="168"&gt;@Benjamin Leonhardi&lt;/A&gt; Please excuse my lack of expertise in spark.  If only a subset of R functions are available due to the translation into spark functions, what are the alternativeS to run the R functions which do not translate into spark functions?&lt;/P&gt;</description>
      <pubDate>Fri, 19 Feb 2016 02:18:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-R-and-Python-Libraries/m-p/146664#M19973</guid>
      <dc:creator>sunile_manjee</dc:creator>
      <dc:date>2016-02-19T02:18:04Z</dc:date>
    </item>
    <item>
      <title>Re: Spark R and Python Libraries</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-R-and-Python-Libraries/m-p/146665#M19974</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1486/smanjee.html" nodeid="1486"&gt;@Sunile Manjee&lt;/A&gt;&lt;/P&gt;&lt;P&gt;It depends. You can run any R function, but only a subset is supported directly on the dataframe. R functions are normally not parallelized so to have true parallel aggregations he needs to translate them into Spark code.&lt;/P&gt;&lt;P&gt;- You can always filter first in Spark and then copy your sparkr dataframe into a local normal R data frame using as.data.frame. &lt;/P&gt;&lt;P&gt;- Other similar tools support the execution of R code on rows/groups of data inside the cluster ( groupApply, TableApply, RowApply in other mapreduce frameworks ) however I do not see a way to do that in Spark they do not seem to have an R library distributed to every node but I might be wrong others can correct me.&lt;/P&gt;&lt;P&gt;- You always have the option to directly execute R from Scala and then do the grouping yourself but that will be a lot of effort &lt;/P&gt;&lt;P&gt;&lt;A href="https://cran.r-project.org/web/packages/rscala/"&gt;https://cran.r-project.org/web/packages/rscala/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Feb 2016 20:32:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-R-and-Python-Libraries/m-p/146665#M19974</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-02-19T20:32:41Z</dc:date>
    </item>
  </channel>
</rss>

