About MVERVUURT

MVERVUURT · ‎07-08-2016

The easiest way I know to get Spark working with Ipython and the Jupyter Notebook is by setting the following two environment variables as described in the book "Learning Spark": IPYTHON=1 IPYTHON_OPTS="notebook" Afterwards running ./bin/pyspark NB: it's possible to pass more Jupyter options using IPYTHON_OPTS; by googling a bit you'll find them.

MVERVUURT · ‎07-06-2016

It may be a bit of a long shot, but you could mount the directories of your remote server in your local server using samba and afterwards copy the files to hdfs from the command line.

MVERVUURT · ‎07-06-2016

I would advise to use ipython's internal debugger ipdb. This debugger allows you to run every statement step by step. * http://quant-econ.net/py/ipython.html#debugging * https://docs.python.org/3/library/pdb.html Finally regarding the other statements above when you using Anaconda's ipython remember to set the environment variable PYSPARK_PYTHON to the location of ipython (ex. /usr/bin/ipython) so PySpark knows where to find ipython. Good luck.

MVERVUURT · ‎07-06-2016

(1) I would start by loading the SparkR package into RStudio so you can make use of it. See the following link under heading "Using SparkR from RStudio" https://github.com/apache/spark/tree/master/R. (2) Now you are ready to run through the following tutorial. However instead of reading the data from "hdfs" load it from your local file system. http://www.r-bloggers.com/a-first-look-at-spark/ (3) Study the SparkR Guide to gain more indepth knowledge. http://spark.apache.org/docs/latest/sparkr.html (4) Study Spark (Dataframes, RDDs, etc) for example with the Oreilly book "Learning Spark". I find that it always helps to understand how something works under the hood. The same holds for SparkR you can easily find some videos about Youtube to understand how it works under the hood, especially the distributed character of SparkR + Spark.

MVERVUURT · ‎07-04-2016

If I may take a different approach on your problem I would use Spark to do the job. Load the data of each file into a separate Spark Data Frame add a new column with the desired value write everything back to HDFS preferably in a format such as Parquet and compressed with snappy.

Online	Offline
Last Visited	‎10-10-2016 05:10 AM

Member Since	‎05-06-2014 11:54 PM
Last Visited	‎10-10-2016 05:10 AM
Posts	14
Kudos received	3

Cloudera Community

Re: Pig Statement its taking a long time

Re: Spark+R in Cloudera 5.3.0

Re: Pig Statement its taking a long time

Re: Pig Statement its taking a long time

Re: How to copy files from remote windows system t...

Re: PYSPARK import pandas

Re: Spark+R in Cloudera 5.3.0

Re: Pig Statement its taking a long time