Created on 12-31-2016 10:45 PM - edited on 02-18-2020 12:27 AM by VidyaSargur
Installing and Exploring Spark 2.0 with Jupyter Notebook and Anaconda Python in your laptop
HW12256:~ usr000$ wget http://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
HW12256:~ usr000$ wget http://repo.continuum.io/archive/Anaconda2-4.2.0-Linux-x86_64.sh
HW12256:~ usr000$ bash Anaconda3-4.2.0-Linux-x86_64.sh
HW12256:~ usr000$ bash Anaconda2-4.2.0-Linux-x86_64.sh
HW12256:~ usr000$ which python /Users/usr000/anaconda/bin/python HW12256:~ usr000$ echo $PATH /Users/usr000/anaconda/bin:/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin HW12256:~ usr000$ python --version Python 3.5.2 :: Anaconda 4.1.1 (x86_64) HW12256:~ usr000$ python Python 3.5.2 |Anaconda 4.1.1 (x86_64)| (default, Jul 2 2016, 17:52:12)[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwinType "help", "copyright", "credits" or "license" for more information. >>> import sys >>> print("Python version: {} ".format(sys.version))Python version: 3.5.2 |Anaconda 4.1.1 (x86_64)| (default, Jul 2 2016, 17:52:12)[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] >>> from datetime import datetime >>> print('current date and time: {}'.format(datetime.now()))current date and time: 2016-12-29 09:46:32.393985 >>> print('current date and time:{}'.format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))current date and time: 2016-12-29 09:51:33 >>> exit()
HW12256:bin usr000$ tar -xvfz spark-2.0.2-bin-hadoop2.7.tgz
HW12256:bin usr000$ ln -s ~/bin/sparks/spark-2.0.2-bin-hadoop2.7 ~/bin/spark2
HW12256:bin usr000$ ls -lru total 16drwxr-xr-x 5 usr000 staff 170 Dec 28 10:39 sparkslrwxr-xr-x 1 usr000 staff 50 Dec 28 10:39 spark2 -> /Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7lrwxr-xr-x 1 usr000 staff 51 May 23 2016 spark -> /Users/usr000/bin/sparks/spark-1.6.1-bin-hadoop2.6/HW12256:bin usr000$ cd spark2HW12256:spark2 usr000$ ls -lrutotal 112drwxr-xr-x@ 3 usr000 staff 102 Jan 1 1970 yarndrwxr-xr-x@ 24 usr000 staff 816 Jan 1 1970 sbindrwxr-xr-x@ 10 usr000 staff 340 Dec 28 10:30 pythondrwxr-xr-x@ 38 usr000 staff 1292 Jan 1 1970 licensesdrwxr-xr-x@ 208 usr000 staff 7072 Dec 28 10:30 jarsdrwxr-xr-x@ 4 usr000 staff 136 Jan 1 1970 examplesdrwxr-xr-x@ 5 usr000 staff 170 Jan 1 1970 datadrwxr-xr-x@ 9 usr000 staff 306 Dec 28 10:27 confdrwxr-xr-x@ 24 usr000 staff 816 Dec 28 10:30 bin-rw-r--r--@ 1 usr000 staff 120 Dec 28 10:25 RELEASE-rw-r--r--@ 1 usr000 staff 3828 Dec 28 10:25 README.mddrwxr-xr-x@ 3 usr000 staff 102 Jan 1 1970 R-rw-r--r--@ 1 usr000 staff 24749 Dec 28 10:25 NOTICE-rw-r--r--@ 1 usr000 staff 17811 Dec 28 10:25 LICENSEHW12256:spark2 usr000$
# export SPARK_HOMEHW12256:spark2 usr000$ export SPARK_HOME=/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7HW12256:spark2 usr000$ echo $SPARK_HOME/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7# Run Spark PI example in ScalaHW12256:spark2 usr000$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --driver-memory 512m --executor-memory 512m --executor-cores 1 $SPARK_HOME/examples/jars/spark-examples*.jar 5Python commandHW12256:spark2 usr000$ ./bin/spark-submit --driver-memory 512m --executor-memory 512m --executor-cores 1 examples/src/main/python/pi.py 10Scala exampleHW12256:spark2 usr000$ export SPARK_HOME=/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7HW12256:spark2 usr000$ echo $SPARK_HOME/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7HW12256:spark2 usr000$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --driver-memory 512m --executor-memory 512m --executor-cores 1 $SPARK_HOME/examples/jars/spark-examples*.jar 5Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties16/12/29 11:40:53 INFO SparkContext: Running Spark version 2.0.216/12/29 11:40:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable...16/12/29 11:40:55 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.851288 sPi is roughly 3.139094278188556216/12/29 11:40:55 INFO SparkUI: Stopped Spark web UI at http://000.000.0.0:4040...16/12/29 11:40:55 INFO SparkContext: Successfully stopped SparkContext16/12/29 11:40:55 INFO ShutdownHookManager: Shutdown hook called16/12/29 11:40:55 INFO ShutdownHookManager: Deleting directory /private/var/folders/1r/8qylt4bj4h59b3h_1xq_nsw00000gp/T/spark-35b67f21-1d52-4dee-9c75-7e9d9c153adaHW12256:spark2 usr000$
HW12256:spark2 usr000$ ./bin/spark-submit examples/src/main/python/pi.py 10 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties16/12/29 11:27:33 INFO SparkContext: Running Spark version 2.0.216/12/29 11:27:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable...16/12/29 11:27:36 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool16/12/29 11:27:36 INFO DAGScheduler: Job 0 finished: reduce at /Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7/examples/src/main/python/pi.py:43, took 1.199257 sPi is roughly 3.13836016/12/29 11:27:36 INFO SparkUI: Stopped Spark web UI at http://http://000.000.0.0:404016/12/29 11:27:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!...16/12/29 11:27:36 INFO SparkContext: Successfully stopped SparkContext16/12/29 11:27:37 INFO ShutdownHookManager: Shutdown hook called16/12/29 11:27:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/1r/8qylt4bj4h59b3h_1xq_nsw00000gp/T/spark-eb12faa9-b7ff-4556-9538-45ddcdc6797b16/12/29 11:27:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/1r/8qylt4bj4h59b3h_1xq_nsw00000gp/T/spark-eb12faa9-b7ff-4556-9538-45ddcdc6797b/pyspark-ba9947c5-dbea-4edc-9c4c-c2c316e6caba
HW12256:spark2 usr000$ ./bin/pyspark Python 2.7.10 (default, Jul 30 2016, 19:40:32)[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwinType "help", "copyright", "credits" or "license" for more information.Using Spark's default log4j profile: org/apache/spark/log4j-defaults.propertiesSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel).16/12/29 12:25:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableWelcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.2 /_/Using Python version 2.7.10 (default, Jul 30 2016 19:40:32)SparkSession available as 'spark'.>>> import os>>> print(os.getcwd())/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7>>> import re>>> from operator import add>>> wordcounts_in = sc.textFile('README.md').flatMap(lambda l: re.split('\W+', l.strip())).filter(lambda w: len(w)>0).map(lambda w: (w,1)).reduceByKey(add).map(lambda (a,b): (b,a)).sortByKey(ascending = False)/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling>>> wordcounts_in.take(10)[(23, u'the'), (18, u'Spark'), (14, u'to'), (13, u'run'), (11, u'for'), (11, u'apache'), (11, u'spark'), (11, u'and'), (11, u'org'), (8, u'a')]>>> wordcounts_in = sc.textFile('README.md').flatMap(lambda l: re.split('\W+', l.strip())).filter(lambda w: len(w)>0).map(lambda w: (w,1)).reduceByKey(add).map(lambda (a,b): (b,a)).sortByKey(ascending = False).map(lambda (a,b): (b,a))/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling>>> wordcounts_in.take(10) [(u'the', 23), (u'Spark', 18), (u'to', 14), (u'run', 13), (u'for', 11), (u'apache', 11), (u'spark', 11), (u'and', 11), (u'org', 11), (u'a', 8)]>>>exit()
HW12256:~ usr000$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook' PYSPARK_PYTHON=python3 /Users/usr000/bin/spark/bin/pyspark --packages com.databricks:spark-csv_2.11:1.4.0
HW12256:~ usr000$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook' PYSPARK_PYTHON=python3 /Users/usr000/bin/spark2/bin/pyspark
from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark import SparkConf # from pyspark.sql import SQLContext spark = SparkSession\ .builder\ .appName("example-spark")\ .config("spark.sql.crossJoin.enabled","true")\ .getOrCreate()sc = SparkContext() # sqlContext = SQLContext(sc)
spark.sparkContext.appName# Configuration is accessible using RuntimeConfig:from py4j.protocol import Py4JErrortry: spark.conf.get("some.conf")except Py4JError as e: pass
HW12256:spark2 usr000$ ./bin/pyspark Python 2.7.10 (default, Jul 30 2016, 19:40:32)[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwinType "help", "copyright", "credits" or "license" for more information.Using Spark's default log4j profile: org/apache/spark/log4j-defaults.propertiesSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel).16/12/29 20:41:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableWelcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.2 /_/Using Python version 2.7.10 (default, Jul 30 2016 19:40:32)SparkSession available as 'spark'.>>> sc<pyspark.context.SparkContext object at 0x101e9c850>>>> sc._conf.getAll()[(u'spark.app.id', u'local-1483040488671'), (u'spark.sql.catalogImplementation', u'hive'), (u'spark.rdd.compress', u'True'), (u'spark.serializer.objectStreamReset', u'100'), (u'spark.master', u'local[*]'), (u'spark.executor.id', u'driver'), (u'spark.submit.deployMode', u'client'), (u'hive.metastore.warehouse.dir', u'file:/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7/spark-warehouse'), (u'spark.driver.port', u'57764'), (u'spark.app.name', u'PySparkShell'), (u'spark.driver.host', u'000.000.0.0')]>>> spark<pyspark.sql.session.SparkSession object at 0x102df9b50>>>> spark.sparkContext<pyspark.context.SparkContext object at 0x101e9c850>>>> spark.sparkContext.appNameu'PySparkShell'>>> from pyspark.sql.functions import *>>> spark.range(1, 7, 2).collect()16/12/29 20:58:32 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.016/12/29 20:58:32 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException[Row(id=1), Row(id=3), Row(id=5)]
## Spark 2.0 and Spark 1.6 compatible read csv#formatPackage = "csv" if sc.version > '1.6' else "com.databricks.spark.csv"df = sqlContext.read.format(formatPackage).options(header='true', delimiter = '|').load("s00_dat/dataframe_sample.csv")df.printSchema()
from sklearn.datasets import load_irisimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltiris = load_iris()df = pd.DataFrame(iris.data, columns=iris.feature_names)df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)