About anandi

Minyan · ‎09-17-2021

I try to install elasticsearch-6.4.2 to my cluster(HDP 3.1 Ambari 2.7.3) , intallation was completed successfully but it could not start, and the error encounterd: Traceback (most recent call last): File "/var/lib/ambari-agent/cache/stacks/HDP/3.1/services/ELASTICSEARCH/package/scripts/es_master.py", line 168, in <module> ESMaster().execute() File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 352, in execute method(env) File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 1011, in restart self.start(env) File "/var/lib/ambari-agent/cache/stacks/HDP/3.1/services/ELASTICSEARCH/package/scripts/es_master.py", line 153, in start self.configure(env) File "/var/lib/ambari-agent/cache/stacks/HDP/3.1/services/ELASTICSEARCH/package/scripts/es_master.py", line 86, in configure group=params.es_group File "/usr/lib/ambari-agent/lib/resource_management/core/base.py", line 166, in __init__ self.env.run() File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 160, in run self.run_action(resource, action) File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 124, in run_action provider_action() File "/usr/lib/ambari-agent/lib/resource_management/core/providers/system.py", line 123, in action_create content = self._get_content() File "/usr/lib/ambari-agent/lib/resource_management/core/providers/system.py", line 160, in _get_content return content() File "/usr/lib/ambari-agent/lib/resource_management/core/source.py", line 52, in __call__ return self.get_content() File "/usr/lib/ambari-agent/lib/resource_management/core/source.py", line 144, in get_content rendered = self.template.render(self.context) File "/usr/lib/ambari-agent/lib/ambari_jinja2/environment.py", line 891, in render return self.environment.handle_exception(exc_info, True) File "/var/lib/ambari-agent/cache/stacks/HDP/3.1/services/ELASTICSEARCH/package/templates/elasticsearch.master.yml.j2", line 93, in top-level template code action.destructive_requires_name: {{action_destructive_requires_name}} File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/config_dictionary.py", line 73, in __getattr__ raise Fail("Configuration parameter '" + self.name + "' was not found in configurations dictionary!") resource_management.core.exceptions.Fail: Configuration parameter 'hostname' was not found in configurations dictionary! I modified the property of discovery.zen.ping.unicast.hosts from elastic-config.xml and hostname from elasticsearch-env.xml, However, it still could not start and the same error encountered, do you have any idea?

RameshMishra · ‎04-21-2021

Sorry it's max 8060 characters

wasiftanveer · ‎05-26-2020

@VidyaSargur Thank you for the response and the suggestion, i will create a new thread for my problem. Edit: i have created my new question here Thanks and regards, Wasif

PhilHibbs · ‎03-25-2020

Is it possible to define a STRUCT element that has an @ sign at the beginning, e.g. "@site" : "Los Angeles" We can live with having the column actually show up as site rather than @site. If we can't do it in the HiveQL syntax then we will have to preprocess the JSON to remove the @ sign, which would be annoying but do-able.

144675 · ‎05-30-2018

@amit nandi can you provide a step by step instruction on how to install anaconda for HDP ?

suraj_lawand · ‎12-26-2018

@Aditya Sirna Can you please suggest how to remove params. As I tried but unable to save the configuration and restart storm.

anandi · ‎09-27-2017

A Machine Learning Model learns from data. As you get new incremental data, the Machine Learning model needs to be upgraded. A Machine Learning Model factory ensures that as you have deployed model in production, continuous learning is also happening on incremental new data ingested in the Production environment. As deployed ML Model's performance decays, a new trained and serialized model needs to be deployed. An A/B test between the deployed model and the newly trained model can score them to evaluate the performance of the deployed model versus the incrementally trained model. In order to build a Machine Learning Model factory, we have to establish a robust road to production, first. The foundational framework is first to establish three environments: DEV, TEST and PROD. 1- DEV - A development environment where the Data Scientists have their own data puddle in order to perform data exploration, profile the data, develop the machine learning features from the data, build the model, train and test it on the limited subset and then commit to git to transport the code to the next stages. For the purpose of scaling and tuning the learning of the Machine Learning model, we establish a DEV Validation environment, where the model learning is scaled with as much historical data as possible and tuned. 2- TEST - The TEST environment is a pre-production environment where we running the machine learning models through integration tests and readying the move of the Machine Learning model to production in two branches: 2a - model deployment: where the trained serialized Machine Learning model is deployed in the production environment 2b - continuous training: where the Machine Learning model is going through continuous training on incremental data 3- PROD - The Production environment is where live data is ingested. In the production environment a deployment server is hosting the serialized trained model. The deployed model exposes a REST api to deliver predictions on live data queries. The ML model code is running in production ingesting incremental live data and getting continuously trained. The deployed model and the continuous training model performances are measured. If the deployed model is showing decay in prediction performance, then it is switched with a newer serialized version of the continuous training model. The model performance measure can be tracked by closing the loop with the users feedback and tracking True Positive, False Positive, True Negative and False Negative. This choreography of training and deploying machine learning models in production is the heart of the ML model factory. The road to production is depicting the journey of building Machine Learning models within the DEV/TEST/PROD environments.

akuardit · ‎12-15-2017

And how to implement this, step how to install? how to install on existing HDP cluster?

anandi · ‎12-31-2016

Installing and Exploring Spark 2.0 with Jupyter Notebook and Anaconda Python in your laptop 1-Objective 2-Installing Anaconda Python 3-Checking Python Install 4-Installing Spark 5-Checking Spark Install 6-Launching Jupyter Notebook with PySpark 2.0.2 7-Exploring PySpark 2.0.2 a.Spark Session b.Read CSV i.Spark 2.0 and Spark 1.6 ii.Pandas c.Pandas DataFrames, Spark DataSets, DataFrames and RDDs d.Machine Learning Pipeline i.SciKit Learn ii.Spark MLLib, ML 8-Conclusion 1-Objective It is often useful to have python with the Jupyter notebook installed on your laptop in order to quickly develop and test some code ideas or to explore some data. Adding the ability to combine Apache Spark to this will also allow you to prototype ideas and exploratory data pipelines before hitting a Hadoop cluster and paying for Amazon Web Services. We leverage the power of the Python ecosystem with libraries such as Numpy (scientific computing library of high-level mathematical functions to operate on arrays and matrices), SciPy (SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation), Pandas (high performance data structure and data analysis library to build complex data transformation flows), Scikit-Learn (library that implements a range of machine learning, preprocessing, cross-validation and visualization algorithms), NLTK (Natural Language Tool Kit to process text data, libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries)… We also leverage the strengths of Spark including Spark-SQL, Spark-MLLib or ML. 2-Installing Anaconda Python We install Continuum’s Anaconda distribution by downloading the install script from the Continuum website. https://www.continuum.io/downloads The advantage of the Anaconda distribution is that lot of the essential python packages comes in bundled. You do not have to struggle with all the dependencies synchronization. We will use the following commands to download the install script. The command is to install Python version 3.5 HW12256:~ usr000$ wget http://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh If you wish to install Python 2.7, the following download is recommended. HW12256:~ usr000$ wget http://repo.continuum.io/archive/Anaconda2-4.2.0-Linux-x86_64.sh Accordingly, in the terminal, issue the following bash command to launch the install. Python 3.5 version HW12256:~ usr000$ bash Anaconda3-4.2.0-Linux-x86_64.sh Python 2.7 version HW12256:~ usr000$ bash Anaconda2-4.2.0-Linux-x86_64.sh In the following steps, we are using Python 3.5 as the base environment. 3-Checking Python Install In order to check the Python install, we issue the following commands in the terminal. HW12256:~ usr000$ which python /Users/usr000/anaconda/bin/python HW12256:~ usr000$ echo $PATH /Users/usr000/anaconda/bin:/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin HW12256:~ usr000$ python --version Python 3.5.2 :: Anaconda 4.1.1 (x86_64) HW12256:~ usr000$ python Python 3.5.2 |Anaconda 4.1.1 (x86_64)| (default, Jul 2 2016, 17:52:12)[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwinType "help", "copyright", "credits" or "license" for more information. >>> import sys >>> print("Python version: {} ".format(sys.version))Python version: 3.5.2 |Anaconda 4.1.1 (x86_64)| (default, Jul 2 2016, 17:52:12)[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] >>> from datetime import datetime >>> print('current date and time: {}'.format(datetime.now()))current date and time: 2016-12-29 09:46:32.393985 >>> print('current date and time:{}'.format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))current date and time: 2016-12-29 09:51:33 >>> exit() Anaconda Python includes a package manager called ‘conda’ which can list and update the existing libraries available in the current system. HW12256:~ usr000$ conda info Current conda install: platform : osx-64 conda version : 4.2.12 conda is private : False conda-env version : 4.2.12 conda-build version : 0+unknown python version : 3.5.2.final.0 requests version : 2.10.0 root environment : /Users/usr000/anaconda (writable) default environment : /Users/usr000/anaconda envs directories : /Users/usr000/anaconda/envs package cache : /Users/usr000/anaconda/pkgs channel URLs : https://repo.continuum.io/pkgs/free/osx-64 https://repo.continuum.io/pkgs/free/noarch https://repo.continuum.io/pkgs/pro/osx-64 https://repo.continuum.io/pkgs/pro/noarch config file : None offline mode : False HW12256:~ usr000$ conda list 4-Installing Spark To install Spark, we download the pre-built spark tarball spark-2.0.2-bin-hadoop2.7.tgz from http://spark.apache.org/downloads.html and move to your target Spark directory. Untar the tarball in your chosen directory HW12256:bin usr000$ tar -xvfz spark-2.0.2-bin-hadoop2.7.tgz Create symlink to spark2 directory HW12256:bin usr000$ ln -s ~/bin/sparks/spark-2.0.2-bin-hadoop2.7 ~/bin/spark2 5-Checking Spark Install Check the directories created under Spark 2 HW12256:bin usr000$ ls -lru total 16drwxr-xr-x 5 usr000 staff 170 Dec 28 10:39 sparkslrwxr-xr-x 1 usr000 staff 50 Dec 28 10:39 spark2 -> /Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7lrwxr-xr-x 1 usr000 staff 51 May 23 2016 spark -> /Users/usr000/bin/sparks/spark-1.6.1-bin-hadoop2.6/HW12256:bin usr000$ cd spark2HW12256:spark2 usr000$ ls -lrutotal 112drwxr-xr-x@ 3 usr000 staff 102 Jan 1 1970 yarndrwxr-xr-x@ 24 usr000 staff 816 Jan 1 1970 sbindrwxr-xr-x@ 10 usr000 staff 340 Dec 28 10:30 pythondrwxr-xr-x@ 38 usr000 staff 1292 Jan 1 1970 licensesdrwxr-xr-x@ 208 usr000 staff 7072 Dec 28 10:30 jarsdrwxr-xr-x@ 4 usr000 staff 136 Jan 1 1970 examplesdrwxr-xr-x@ 5 usr000 staff 170 Jan 1 1970 datadrwxr-xr-x@ 9 usr000 staff 306 Dec 28 10:27 confdrwxr-xr-x@ 24 usr000 staff 816 Dec 28 10:30 bin-rw-r--r--@ 1 usr000 staff 120 Dec 28 10:25 RELEASE-rw-r--r--@ 1 usr000 staff 3828 Dec 28 10:25 README.mddrwxr-xr-x@ 3 usr000 staff 102 Jan 1 1970 R-rw-r--r--@ 1 usr000 staff 24749 Dec 28 10:25 NOTICE-rw-r--r--@ 1 usr000 staff 17811 Dec 28 10:25 LICENSEHW12256:spark2 usr000$ Running SparkPi example in local mode. Scala command # export SPARK_HOMEHW12256:spark2 usr000$ export SPARK_HOME=/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7HW12256:spark2 usr000$ echo $SPARK_HOME/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7# Run Spark PI example in ScalaHW12256:spark2 usr000$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --driver-memory 512m --executor-memory 512m --executor-cores 1 $SPARK_HOME/examples/jars/spark-examples*.jar 5Python commandHW12256:spark2 usr000$ ./bin/spark-submit --driver-memory 512m --executor-memory 512m --executor-cores 1 examples/src/main/python/pi.py 10Scala exampleHW12256:spark2 usr000$ export SPARK_HOME=/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7HW12256:spark2 usr000$ echo $SPARK_HOME/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7HW12256:spark2 usr000$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --driver-memory 512m --executor-memory 512m --executor-cores 1 $SPARK_HOME/examples/jars/spark-examples*.jar 5Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties16/12/29 11:40:53 INFO SparkContext: Running Spark version 2.0.216/12/29 11:40:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable...16/12/29 11:40:55 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.851288 sPi is roughly 3.139094278188556216/12/29 11:40:55 INFO SparkUI: Stopped Spark web UI at http://000.000.0.0:4040...16/12/29 11:40:55 INFO SparkContext: Successfully stopped SparkContext16/12/29 11:40:55 INFO ShutdownHookManager: Shutdown hook called16/12/29 11:40:55 INFO ShutdownHookManager: Deleting directory /private/var/folders/1r/8qylt4bj4h59b3h_1xq_nsw00000gp/T/spark-35b67f21-1d52-4dee-9c75-7e9d9c153adaHW12256:spark2 usr000$ Python example HW12256:spark2 usr000$ ./bin/spark-submit examples/src/main/python/pi.py 10 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties16/12/29 11:27:33 INFO SparkContext: Running Spark version 2.0.216/12/29 11:27:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable...16/12/29 11:27:36 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool16/12/29 11:27:36 INFO DAGScheduler: Job 0 finished: reduce at /Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7/examples/src/main/python/pi.py:43, took 1.199257 sPi is roughly 3.13836016/12/29 11:27:36 INFO SparkUI: Stopped Spark web UI at http://http://000.000.0.0:404016/12/29 11:27:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!...16/12/29 11:27:36 INFO SparkContext: Successfully stopped SparkContext16/12/29 11:27:37 INFO ShutdownHookManager: Shutdown hook called16/12/29 11:27:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/1r/8qylt4bj4h59b3h_1xq_nsw00000gp/T/spark-eb12faa9-b7ff-4556-9538-45ddcdc6797b16/12/29 11:27:37 INFO ShutdownHookManager: Deleting directory /private/var/folders/1r/8qylt4bj4h59b3h_1xq_nsw00000gp/T/spark-eb12faa9-b7ff-4556-9538-45ddcdc6797b/pyspark-ba9947c5-dbea-4edc-9c4c-c2c316e6caba Wordcount program using PySpark HW12256:spark2 usr000$ ./bin/pyspark Python 2.7.10 (default, Jul 30 2016, 19:40:32)[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwinType "help", "copyright", "credits" or "license" for more information.Using Spark's default log4j profile: org/apache/spark/log4j-defaults.propertiesSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel).16/12/29 12:25:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableWelcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.2 /_/Using Python version 2.7.10 (default, Jul 30 2016 19:40:32)SparkSession available as 'spark'.>>> import os>>> print(os.getcwd())/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7>>> import re>>> from operator import add>>> wordcounts_in = sc.textFile('README.md').flatMap(lambda l: re.split('\W+', l.strip())).filter(lambda w: len(w)>0).map(lambda w: (w,1)).reduceByKey(add).map(lambda (a,b): (b,a)).sortByKey(ascending = False)/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling>>> wordcounts_in.take(10)[(23, u'the'), (18, u'Spark'), (14, u'to'), (13, u'run'), (11, u'for'), (11, u'apache'), (11, u'spark'), (11, u'and'), (11, u'org'), (8, u'a')]>>> wordcounts_in = sc.textFile('README.md').flatMap(lambda l: re.split('\W+', l.strip())).filter(lambda w: len(w)>0).map(lambda w: (w,1)).reduceByKey(add).map(lambda (a,b): (b,a)).sortByKey(ascending = False).map(lambda (a,b): (b,a))/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling>>> wordcounts_in.take(10) [(u'the', 23), (u'Spark', 18), (u'to', 14), (u'run', 13), (u'for', 11), (u'apache', 11), (u'spark', 11), (u'and', 11), (u'org', 11), (u'a', 8)]>>>exit() 6-Launching Jupyter Notebook with PySpark Launching Jupyter Notebook with Spark 1.6.*, we use to associate the --packages com.databricks:spark-csv_2.11:1.4.0 parameter in the command as the csv package was not natively part of Spark. HW12256:~ usr000$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook' PYSPARK_PYTHON=python3 /Users/usr000/bin/spark/bin/pyspark --packages com.databricks:spark-csv_2.11:1.4.0 In the case of Spark 2.0.*, we do not need to associate the spark-csv –packages parameter, as spark-csv is part of the standard Spark 2.0 library. HW12256:~ usr000$ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook' PYSPARK_PYTHON=python3 /Users/usr000/bin/spark2/bin/pyspark 7-Exploring PySpark 2.0.2 We will explore the new features of Spark 2.0.2 using PySpark and contrasting where appropriate with previous version of spark and with pandas. In the case of the machine learning pipeline, we will contract Spark MLLib or ML with Scikit Learn. a.Spark Session Spark 2.0 introduces SparkSession. SparkSession is the single entry point for interacting with Spark functionality. It replaces and encapsulates the SQLContext, HiveContext and StreamingContext for a more unified access to the DataFrame and Dataset APIs. The SQLContext, HiveContext and StreamingContext still exist under the hood in Spark 2.0 for continuity purpose with the Spark legacy code. The Spark session has to be created when using spark-submit command. An example on how to do that: from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark import SparkConf # from pyspark.sql import SQLContext spark = SparkSession\ .builder\ .appName("example-spark")\ .config("spark.sql.crossJoin.enabled","true")\ .getOrCreate()sc = SparkContext() # sqlContext = SQLContext(sc) When typing ‘pyspark’ at the terminal, python automatically creates the spark context sc. A SparkSession is automatically generated and available as 'spark'. Application name can be accessed using SparkContext. spark.sparkContext.appName# Configuration is accessible using RuntimeConfig:from py4j.protocol import Py4JErrortry: spark.conf.get("some.conf")except Py4JError as e: pass The following code outline the available spark context sc as well as the new spark session under the name "spark" which includes the previous sqlContext, HiveContext, StreamingContext under one unified single entry point. sqlContext, HiveContext, StreamingContext still exist to ensure continuity with legacy code in Spark. HW12256:spark2 usr000$ ./bin/pyspark Python 2.7.10 (default, Jul 30 2016, 19:40:32)[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwinType "help", "copyright", "credits" or "license" for more information.Using Spark's default log4j profile: org/apache/spark/log4j-defaults.propertiesSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel).16/12/29 20:41:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableWelcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.2 /_/Using Python version 2.7.10 (default, Jul 30 2016 19:40:32)SparkSession available as 'spark'.>>> sc<pyspark.context.SparkContext object at 0x101e9c850>>>> sc._conf.getAll()[(u'spark.app.id', u'local-1483040488671'), (u'spark.sql.catalogImplementation', u'hive'), (u'spark.rdd.compress', u'True'), (u'spark.serializer.objectStreamReset', u'100'), (u'spark.master', u'local[*]'), (u'spark.executor.id', u'driver'), (u'spark.submit.deployMode', u'client'), (u'hive.metastore.warehouse.dir', u'file:/Users/usr000/bin/sparks/spark-2.0.2-bin-hadoop2.7/spark-warehouse'), (u'spark.driver.port', u'57764'), (u'spark.app.name', u'PySparkShell'), (u'spark.driver.host', u'000.000.0.0')]>>> spark<pyspark.sql.session.SparkSession object at 0x102df9b50>>>> spark.sparkContext<pyspark.context.SparkContext object at 0x101e9c850>>>> spark.sparkContext.appNameu'PySparkShell'>>> from pyspark.sql.functions import *>>> spark.range(1, 7, 2).collect()16/12/29 20:58:32 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.016/12/29 20:58:32 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException[Row(id=1), Row(id=3), Row(id=5)] b.Read CSV We describe how to easily access csv files from spark and from pandas and load them into dataframe for data exploration, maniputation and mining. i.Spark 2.0 & Spark 1.6 We can create a spark dataframe directly from reading the csv file. In order to be compatible with previous format we have include a conditional switch in the format statement ## Spark 2.0 and Spark 1.6 compatible read csv#formatPackage = "csv" if sc.version > '1.6' else "com.databricks.spark.csv"df = sqlContext.read.format(formatPackage).options(header='true', delimiter = '|').load("s00_dat/dataframe_sample.csv")df.printSchema() ii.Pandas We can create the iris pandas dataframe from the existing dataset from sklearn. from sklearn.datasets import load_irisimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltiris = load_iris()df = pd.DataFrame(iris.data, columns=iris.feature_names)df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names) c.Dataframes i.Pandas DataFrames Pandas dataframes in conjunction with visualization libraries such as matplotlib and seaborn give us some nice insights into the data ii.Spark DataSets, Spark DataFrames and Spark RDDs Spark Dataframe and Spark RDDs are the fundamental data structure that allow us to manipulate and interact with the various Spark libraries. Spark DataSets are more relevant for Scala developpers and give the ability to create typed spark dataframe. d.Machine Learning i.SciKit Learn We demonstrate a random forest machine learning pipeline using scikit learn in the ipython notebook. ii.Spark MLLib, Spark ML We demonstrate a random forest machine learning pipeline using Spark MLlib and Spark ML 8-Conclusion Spark and Jupyter Notebook using the Anaconda Python distribution provide a very powerful development environment in your laptop. It allows quick exploration of data mining, machine learning, visualizations in a flexible and easy to use environment. We have described the installation of Jupyter Notebook, Spark. We have described few data processing pipeline as well as a machine learning classification using Random Forest.

musushakeel · ‎03-14-2019

sc.version() or spark -submit --version

Online	Offline
Last Visited	‎03-12-2019 01:10 PM

Member Since	‎04-11-2016 09:25 AM
Last Visited	‎03-12-2019 01:10 PM
Posts	38
Kudos received	14

Cloudera Community

Re: Spark Job Failing "Could not find or load main...

Re: How to SFTP a generated file in spark 1.4.1

Re: How to check a correct install of spark? ( Whe...

Re: Sandbox HDP 2.5.0 - Spark 1.6.2 - Issues: GPLN...

Re: Sandbox HDP-2.5.0 Spark 2.0.0 - Spark Submit Y...

Re: How To Install ELK Stack (6.3.2) in Ambari

Re: how to read fixed length files in Spark

Re: Interacting with Hadoop HDFS using Python code...

Re: Complex Json transformation using Hive functio...

Re: Zeppelin + PySpark - Adding Libraries Numpy/Pa...

Re: Storm - Supervisor and Nimbus dropping immedia...

Machine Learning Model Factory and Road To Product...

Re: Setting Up a Data Science Platform on HDP usin...

Installing and Exploring Spark 2.0 with Jupyter No...

Re: How do I tell which version ofSpark I am runni...