Member since
07-10-2018
63
Posts
71
Kudos Received
0
Solutions
05-14-2019
12:05 PM
My blood recently turned from green to blue (after the Hortonworks-Cloudera merger) and I couldn't be more excited to play with new toys. What I am particularly excited about is Cloudera Data Science Workbench. But, like in everything I do, I am very lazy. So here is a quick tutorial to install Altus Director, and use it to deploy a CDH 5.15 + CDSW cluster. Step 1: Install Altus Director Many ways to do that, but the one I chose was the AWS install, detailed here: https://www.cloudera.com/documentation/director/latest/topics/director_aws_setup_client.html The installation documentation is very well done, but here are the important excerpts Create a VPC for your Altus instance Follow the documentation. Few important points: In the name of laziness I also recommend to add a 0-65535 rule from your personal IP. Your VPC should have an internet gateway associated with it (you could do it without, but would require you manually pulling the CM/CDH software down and make internal repositories within your subnet) Do not forget to open all traffic to your security group as described here. Your deployment will not work otherwise. Launch a Redhat 7.3 instance You can either search communities AMIs, or use this one: ami-6871a115 Install Altus Connect to your ec2 instance: ssh -i your_file.pem ec2-user@your_instance_ip Install JDK and wget sudo yum install java-1.8.0-openjdk sudo yum install wget Install/Start Altus server and client: cd /etc/yum.repos.d/ sudo wget "http://archive.cloudera.com/director6/6.1/redhat7/cloudera-director.repo" sudo yum install cloudera-director-server cloudera-director-client sudo service cloudera-director-server start sudo systemctl disable firewalld sudo systemctl stop firewalld Connect to Altus Director Go to http://your_instance_ip:7189/ and connect with admin/admin Step 2: Modify the Director configuration file CDSW cluster configuration can be found here https://github.com/cloudera/director-scripts/blob/master/configs/aws.cdsw.conf Modify the configuration file to use: Your AWS accessKeyId/secretAccessKey Your AWS region Your AWS subnetId (same as the one you created for your Director instance) Your AWS securityGroupsIds (same as the one you created for your Director instance) Your private key path (e.g. /home/ec2-user/field.pem) Your AWS image (e.g. ami-6871a115) Step 3: Launch the cluster via director client Go to your EC2 instance where Director is installed, and load your modified configuration file as well as the appropriate key. Finally, run the following: cloudera-director bootstrap-remote your_configuration_file.conf \ --lp.remote.username=admin \ --lp.remote.password=admin Step 4: Access Cloudera Manager You can follow the bootstrapping of the cluster both on command line or in the Director interface; once done, you can connect to Cloudera Manager using: http://your_manager_instance_ip:7180/ Step 5: Configure CDSW domain with your IP Cloudera Data Science Workbench uses DNS. The correct approach is to setup a wildcard DNS record is required, as described here. However, for testing purposes I used nip.io. The only parameter to change is the Cloudera Data Science Workbench Domain, from cdsw.my-domain.com as the conf file sets it up to, to cdsw.[YOUR_AWS_PUBLIC_IP].nip.io , as depicted below: Restart the CDSW service, then you should be able to access CDSW by clicking on the CDSW Web UI link. Register for a new account and you will have access to CDSW:
... View more
04-23-2019
06:59 PM
1 Kudo
Introduction Overview Cloudera/Hortonworks offers today one of the most comprehensive data management platform, with components allowing you data flow management to governed and distributed Data Science workloads. With so many toys to play with, I thought I'd share an easy way to setup a simple cluster that will, using Cloudbreak, setup the following main components on Azure cloud: Hortonworks Data Platform 3.1 Hortonworks Data Flow 3.3 Data Platform Search 4.0 Cloudera Data Science Workbench 1.5 Note: This is not a production-ready setup, but merely a first step to customizing your deployment using the Cloudera toolkit. Pre-Requisites Account on Azure with permission to assign roles Cloudbreak 2.9 (you can always set one up on your machine using: https://community.hortonworks.com/articles/194076/using-vagrant-and-virtualbox-to-create-a-local-ins.html) Tutorial steps Step 1: Setup Azure Credentials Step 2: Setup blueprint and cluster extensions Step 3: Create cluster Step 1: Setup Azure Credentials Find your Azure subscription and tenant ID To find your subscription ID, go to the search box and look for subscription; you should find it as such: For the tenant ID, use the Azure AD Directory ID: Setup your credentials in Cloudbreak This part is extremely well documented in Cloudbreak's documentation portal: https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-2.9.0/create-credential-azure/content/cb_create-app-based-credential.html. Note: Because of IT restrictions on my side, I chose to use an app based credential setup, but if you have enough privileges, Cloudbreak creates the app and assigns roles automagically for you. Step 2: Setup blueprint and cluster extensions Blueprint First upload the blueprint below: {
"Blueprints": {
"blueprint_name": "edge-to-ai-3.1",
"stack_name": "HDP",
"stack_version": "3.1"
},
"configurations": [
{
"yarn-site": {
"properties": {
"yarn.nodemanager.resource.cpu-vcores": "6",
"yarn.nodemanager.resource.memory-mb": "60000",
"yarn.scheduler.maximum-allocation-mb": "14"
}
}
},
{
"hdfs-site": {
"properties": {
"dfs.cluster.administrators": "hdfs"
}
}
},
{
"capacity-scheduler": {
"properties": {
"yarn.scheduler.capacity.maximum-am-resource-percent": "0.4",
"yarn.scheduler.capacity.root.capacity": "67",
"yarn.scheduler.capacity.root.default.capacity": "67",
"yarn.scheduler.capacity.root.default.maximum-capacity": "67",
"yarn.scheduler.capacity.root.llap.capacity": "33",
"yarn.scheduler.capacity.root.llap.maximum-capacity": "33",
"yarn.scheduler.capacity.root.queues": "default,llap"
}
}
},
{
"ranger-hive-audit": {
"properties": {
"xasecure.audit.destination.hdfs.file.rollover.sec": "300"
},
"properties_attributes": {}
}
},
{
"hive-site": {
"hive.exec.compress.output": "true",
"hive.merge.mapfiles": "true",
"hive.metastore.dlm.events": "true",
"hive.metastore.transactional.event.listeners": "org.apache.hive.hcatalog.listener.DbNotificationListener",
"hive.repl.cm.enabled": "true",
"hive.repl.cmrootdir": "/apps/hive/cmroot",
"hive.repl.rootdir": "/apps/hive/repl",
"hive.server2.tez.initialize.default.sessions": "true",
"hive.server2.transport.mode": "http"
}
},
{
"hive-interactive-env": {
"enable_hive_interactive": "true",
"hive_security_authorization": "Ranger",
"num_llap_nodes": "1",
"num_llap_nodes_for_llap_daemons": "1",
"num_retries_for_checking_llap_status": "50"
}
},
{
"hive-interactive-site": {
"hive.exec.orc.split.strategy": "HYBRID",
"hive.llap.daemon.num.executors": "5",
"hive.metastore.rawstore.impl": "org.apache.hadoop.hive.metastore.cache.CachedStore",
"hive.stats.fetch.bitvector": "true"
}
},
{
"spark2-defaults": {
"properties": {
"spark.datasource.hive.warehouse.load.staging.dir": "/tmp",
"spark.datasource.hive.warehouse.metastoreUri": "thrift://%HOSTGROUP::master1%:9083",
"spark.hadoop.hive.zookeeper.quorum": "{{zookeeper_quorum_hosts}}",
"spark.sql.hive.hiveserver2.jdbc.url": "jdbc:hive2://{{zookeeper_quorum_hosts}}:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive",
"spark.sql.hive.hiveserver2.jdbc.url.principal": "hive/_HOST@EC2.INTERNAL"
},
"properties_attributes": {}
}
},
{
"gateway-site": {
"properties": {
"gateway.path": "{{cluster_name}}"
},
"properties_attributes": {}
}
},
{
"admin-topology": {
"properties": {
"content": "\n \n\n \n\n \n authentication\n ShiroProvider\n true\n \n sessionTimeout\n 30\n \n \n main.ldapRealm\n org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm\n \n \n main.ldapRealm.userDnTemplate\n uid={0},ou=people,dc=hadoop,dc=apache,dc=org\n \n \n main.ldapRealm.contextFactory.url\n ldap://54.219.163.9:33389\n \n \n main.ldapRealm.contextFactory.authenticationMechanism\n simple\n \n \n urls./**\n authcBasic\n \n \n\n \n authorization\n AclsAuthz\n true\n \n \n\n \n KNOX\n \n\n "
},
"properties_attributes": {}
}
},
{
"ranger-admin-site": {
"properties": {
"ranger.jpa.jdbc.url": "jdbc:postgresql://localhost:5432/ranger"
},
"properties_attributes": {}
}
},
{
"ranger-env": {
"properties": {
"is_solrCloud_enabled": "true",
"keyadmin_user_password": "{{{ general.password }}}",
"ranger-atlas-plugin-enabled": "Yes",
"ranger-hdfs-plugin-enabled": "Yes",
"ranger-hive-plugin-enabled": "Yes",
"ranger-knox-plugin-enabled": "Yes",
"ranger_admin_password": "{{{ general.password }}}",
"rangertagsync_user_password": "{{{ general.password }}}",
"rangerusersync_user_password": "{{{ general.password }}}"
},
"properties_attributes": {}
}
},
{
"ams-hbase-site": {
"properties": {
"hbase.cluster.distributed": "true",
"hbase.rootdir": "file:///hadoopfs/fs1/metrics/hbase/data"
}
}
},
{
"atlas-env": {
"properties": {
"atlas.admin.password": "admin",
"atlas_solr_shards": "2",
"content": "\n # The java implementation to use. If JAVA_HOME is not found we expect java and jar to be in path\n export JAVA_HOME={{java64_home}}\n\n # any additional java opts you want to set. This will apply to both client and server operations\n {% if security_enabled %}\n export ATLAS_OPTS=\"{{metadata_opts}} -Djava.security.auth.login.config={{atlas_jaas_file}}\"\n {% else %}\n export ATLAS_OPTS=\"{{metadata_opts}}\"\n {% endif %}\n\n # metadata configuration directory\n export ATLAS_CONF={{conf_dir}}\n\n # Where log files are stored. Defatult is logs directory under the base install location\n export ATLAS_LOG_DIR={{log_dir}}\n\n # additional classpath entries\n export ATLASCPPATH={{metadata_classpath}}\n\n # data dir\n export ATLAS_DATA_DIR={{data_dir}}\n\n # pid dir\n export ATLAS_PID_DIR={{pid_dir}}\n\n # hbase conf dir\n export HBASE_CONF_DIR=\"/etc/ams-hbase/conf\"\n\n # Where do you want to expand the war file. By Default it is in /server/webapp dir under the base install dir.\n export ATLAS_EXPANDED_WEBAPP_DIR={{expanded_war_dir}}\n export ATLAS_SERVER_OPTS=\"-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$ATLAS_LOG_DIR/atlas_server.hprof -Xloggc:$ATLAS_LOG_DIRgc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps\"\n {% if java_version == 8 %}\n export ATLAS_SERVER_HEAP=\"-Xms{{atlas_server_xmx}}m -Xmx{{atlas_server_xmx}}m -XX:MaxNewSize={{atlas_server_max_new_size}}m -XX:MetaspaceSize=100m -XX:MaxMetaspaceSize=512m\"\n {% else %}\n export ATLAS_SERVER_HEAP=\"-Xms{{atlas_server_xmx}}m -Xmx{{atlas_server_xmx}}m -XX:MaxNewSize={{atlas_server_max_new_size}}m -XX:MaxPermSize=512m\"\n {% endif %}\n",
"hbase_conf_dir": "/etc/ams-hbase/conf"
}
}
},
{
"kafka-broker": {
"properties": {
"default.replication.factor": "1",
"offsets.topic.replication.factor": "1"
},
"properties_attributes": {}
}
},
{
"hbase-env": {
"properties": {
"phoenix_sql_enabled": "true"
},
"properties_attributes": {}
}
},
{
"druid-common": {
"properties": {
"druid.extensions.loadList": "[\"postgresql-metadata-storage\", \"druid-datasketches\", \"druid-hdfs-storage\", \"druid-kafka-indexing-service\", \"ambari-metrics-emitter\"]",
"druid.indexer.logs.directory": "/user/druid/logs",
"druid.indexer.logs.type": "hdfs",
"druid.metadata.storage.connector.connectURI": "jdbc:postgresql://%HOSTGROUP::master1%:5432/druid",
"druid.metadata.storage.connector.password": "druid",
"druid.metadata.storage.connector.user": "druid",
"druid.metadata.storage.type": "postgresql",
"druid.selectors.indexing.serviceName": "druid/overlord",
"druid.storage.storageDirectory": "/user/druid/data",
"druid.storage.type": "hdfs"
},
"properties_attributes": {}
}
},
{
"druid-overlord": {
"properties": {
"druid.indexer.runner.type": "remote",
"druid.indexer.storage.type": "metadata",
"druid.port": "8090",
"druid.service": "druid/overlord"
},
"properties_attributes": {}
}
},
{
"druid-middlemanager": {
"properties": {
"druid.indexer.runner.javaOpts": "-server -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dhdp.version={{stack_version}} -Dhadoop.mapreduce.job.classloader=true",
"druid.port": "8091",
"druid.processing.numThreads": "2",
"druid.server.http.numThreads": "50",
"druid.service": "druid/middlemanager",
"druid.worker.capacity": "3"
},
"properties_attributes": {}
}
},
{
"druid-coordinator": {
"properties": {
"druid.coordinator.merge.on": "false",
"druid.port": "8081"
},
"properties_attributes": {}
}
},
{
"druid-historical": {
"properties": {
"druid.port": "8083",
"druid.processing.numThreads": "2",
"druid.server.http.numThreads": "50",
"druid.server.maxSize": "300000000000",
"druid.service": "druid/historical"
},
"properties_attributes": {}
}
},
{
"druid-broker": {
"properties": {
"druid.broker.http.numConnections": "5",
"druid.cache.type": "local",
"druid.port": "8082",
"druid.processing.numThreads": "2",
"druid.server.http.numThreads": "50",
"druid.service": "druid/broker"
},
"properties_attributes": {}
}
},
{
"druid-router": {
"properties": {},
"properties_attributes": {}
}
},
{
"superset": {
"properties": {
"SECRET_KEY": "{{{ general.password }}}",
"SUPERSET_DATABASE_TYPE": "sqlite"
},
"properties_attributes": {}
}
},
{
"nifi-ambari-config": {
"nifi.max_mem": "4g",
"nifi.security.encrypt.configuration.password": "{{{ general.password }}}",
"nifi.sensitive.props.key": "{{{ general.password }}}"
}
},
{
"nifi-properties": {
"nifi.security.user.login.identity.provider": "",
"nifi.sensitive.props.key": "{{{ general.password }}}"
}
},
{
"nifi-registry-ambari-config": {
"nifi.registry.security.encrypt.configuration.password": "{{{ general.password }}}"
}
},
{
"nifi-registry-properties": {
"nifi.registry.db.password": "{{{ general.password }}}",
"nifi.registry.sensitive.props.key": "{{{ general.password }}}"
}
},
{
"registry-common": {
"properties": {
"adminPort": "7789",
"database_name": "registry",
"jar.storage": "/hdf/registry",
"jar.storage.hdfs.url": "hdfs://localhost:9090",
"jar.storage.type": "local",
"port": "7788",
"registry.schema.cache.expiry.interval": "3600",
"registry.schema.cache.size": "10000",
"registry.storage.connector.connectURI": "jdbc:mysql://localhost:3306/registry",
"registry.storage.connector.password": "registry",
"registry.storage.connector.user": "registry",
"registry.storage.query.timeout": "30",
"registry.storage.type": "mysql"
},
"properties_attributes": {}
}
},
{
"hbase-site": {
"properties": {
"hbase.bucketcache.combinedcache.enabled": "true",
"hbase.bucketcache.ioengine": "file:/hbase/cache",
"hbase.bucketcache.size": "24000",
"hbase.defaults.for.version.skip": "true",
"hbase.hregion.max.filesize": "21474836480",
"hbase.hregion.memstore.flush.size": "536870912",
"hbase.region.server.rpc.scheduler.factory.class": "org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory",
"hbase.regionserver.global.memstore.size": "0.4",
"hbase.regionserver.handler.count": "60",
"hbase.regionserver.wal.codec": "org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec",
"hbase.rootdir": "/apps/hbase",
"hbase.rpc.controllerfactory.class": "org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory",
"hbase.rs.cacheblocksonwrite": "true",
"hfile.block.bloom.cacheonwrite": "true",
"hfile.block.cache.size": "0.4",
"hfile.block.index.cacheonwrite": "true",
"phoenix.functions.allowUserDefinedFunctions": "true",
"phoenix.query.timeoutMs": "60000"
},
"properties_attributes": {}
}
},
{
"hbase-env": {
"properties": {
"hbase_java_io_tmpdir": "/tmp",
"hbase_log_dir": "/var/log/hbase",
"hbase_master_heapsize": "1024m",
"hbase_pid_dir": "/var/run/hbase",
"hbase_regionserver_heapsize": "16384m",
"hbase_regionserver_shutdown_timeout": "30",
"hbase_regionserver_xmn_max": "16384",
"hbase_regionserver_xmn_ratio": "0.2",
"hbase_user": "hbase",
"hbase_user_nofile_limit": "32000",
"hbase_user_nproc_limit": "16000",
"phoenix_sql_enabled": "true"
},
"properties_attributes": {}
}
}
],
"host_groups": [
{
"cardinality": "1",
"components": [
{
"name": "RANGER_TAGSYNC"
},
{
"name": "RANGER_USERSYNC"
},
{
"name": "RANGER_ADMIN"
},
{
"name": "KNOX_GATEWAY"
},
{
"name": "HIVE_SERVER"
},
{
"name": "HIVE_METASTORE"
},
{
"name": "DRUID_OVERLORD"
},
{
"name": "DRUID_COORDINATOR"
},
{
"name": "DRUID_ROUTER"
},
{
"name": "DRUID_BROKER"
},
{
"name": "SECONDARY_NAMENODE"
},
{
"name": "HISTORYSERVER"
},
{
"name": "APP_TIMELINE_SERVER"
},
{
"name": "REGISTRY_SERVER"
},
{
"name": "NIFI_REGISTRY_MASTER"
},
{
"name": "DATANODE"
},
{
"name": "YARN_CLIENT"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "INFRA_SOLR_CLIENT"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "MAPREDUCE2_CLIENT"
},
{
"name": "ATLAS_CLIENT"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "HIVE_CLIENT"
},
{
"name": "LOGSEARCH_LOGFEEDER"
},
{
"name": "SPARK2_CLIENT"
}
],
"name": "master1"
},
{
"cardinality": "1+",
"components": [
{
"name": "NODEMANAGER"
},
{
"name": "DATANODE"
},
{
"name": "HBASE_REGIONSERVER"
},
{
"name": "MAPREDUCE2_CLIENT"
},
{
"name": "YARN_CLIENT"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "ATLAS_CLIENT"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "HIVE_CLIENT"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "LOGSEARCH_LOGFEEDER"
},
{
"name": "KAFKA_BROKER"
},
{
"name": "NIFI_MASTER"
},
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "SPARK2_CLIENT"
}
],
"name": "worker"
},
{
"cardinality": "1",
"components": [
{
"name": "ATLAS_SERVER"
},
{
"name": "HBASE_MASTER"
},
{
"name": "METRICS_COLLECTOR"
},
{
"name": "RESOURCEMANAGER"
},
{
"name": "DRUID_HISTORICAL"
},
{
"name": "DRUID_MIDDLEMANAGER"
},
{
"name": "LIVY2_SERVER"
},
{
"name": "SPARK2_JOBHISTORYSERVER"
},
{
"name": "DATANODE"
},
{
"name": "HIVE_CLIENT"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "ATLAS_CLIENT"
},
{
"name": "MAPREDUCE2_CLIENT"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "LOGSEARCH_LOGFEEDER"
},
{
"name": "LOGSEARCH_SERVER"
},
{
"name": "NAMENODE"
},
{
"name": "SUPERSET"
},
{
"name": "NIFI_CA"
},
{
"name": "INFRA_SOLR"
},
{
"name": "METRICS_GRAFANA"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "HBASE_MASTER"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "SPARK2_CLIENT"
}
],
"name": "master2"
},
{
"name": "cdsw_worker",
"cardinality": "1+",
"components": [
{
"name": "SPARK2_CLIENT"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "YARN_CLIENT"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "MAPREDUCE2_CLIENT"
},
{
"name": "HIVE_CLIENT"
},
{
"name": "NODEMANAGER"
},
{
"name": "DATANODE"
},
{
"name": "KAFKA_BROKER"
},
{
"name": "NIFI_MASTER"
},
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "HBASE_REGIONSERVER"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "TEZ_CLIENT"
}
]
}
],
"settings": [
{
"recovery_settings": [
{
"recovery_enabled": "false"
}
]
}
]
} Recipes Pre Ambari start recipe to setup metastores #!/usr/bin/env bash
# Intialize MetaStores
yum install -y https://download.postgresql.org/pub/repos/yum/9.6/redhat/rhel-7-x86_64/pgdg-redhat96-9.6-3.noarch.rpm
yum install -y postgresql96-server
yum install -y postgresql96-contrib
/usr/pgsql-9.6/bin/postgresql96-setup initdb
sed -i 's,#port = 5432,port = 5433,g' /var/lib/pgsql/9.6/data/postgresql.conf
echo '' > /var/lib/pgsql/9.6/data/pg_hba.conf
echo 'local all das,streamsmsgmgr,cloudbreak,registry,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid trust ' >> /var/lib/pgsql/9.6/data/pg_hba.conf
echo 'host all das,streamsmsgmgr,cloudbreak,registry,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid 0.0.0.0/0 trust ' >> /var/lib/pgsql/9.6/data/pg_hba.conf
echo 'host all das,streamsmsgmgr,cloudbreak,registry,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid ::/0 trust ' >> /var/lib/pgsql/9.6/data/pg_hba.conf
echo 'local all all peer ' >> /var/lib/pgsql/9.6/data/pg_hba.conf
echo 'host all all 127.0.0.1/32 ident ' >> /var/lib/pgsql/9.6/data/pg_hba.conf
echo 'host all all ::1/128 ident ' >> /var/lib/pgsql/9.6/data/pg_hba.conf
systemctl enable postgresql-9.6.service
systemctl start postgresql-9.6.service
echo "CREATE DATABASE streamsmsgmgr;" | sudo -u postgres psql -U postgres -h localhost -p 5433
echo "CREATE USER streamsmsgmgr WITH PASSWORD 'streamsmsgmgr';" | sudo -u postgres psql -U postgres -h localhost -p 5433
echo "GRANT ALL PRIVILEGES ON DATABASE streamsmsgmgr TO streamsmsgmgr;" | sudo -u postgres psql -U postgres -h localhost -p 5433
echo "CREATE DATABASE druid;" | sudo -u postgres psql -U postgres
echo "CREATE DATABASE ranger;" | sudo -u postgres psql -U postgres
echo "CREATE DATABASE registry;" | sudo -u postgres psql -U postgres
echo "CREATE USER druid WITH PASSWORD 'druid';" | sudo -u postgres psql -U postgres
echo "CREATE USER registry WITH PASSWORD 'registry';" | sudo -u postgres psql -U postgres
echo "CREATE USER rangerdba WITH PASSWORD 'rangerdba';" | sudo -u postgres psql -U postgres
echo "CREATE USER rangeradmin WITH PASSWORD 'ranger';" | sudo -u postgres psql -U postgres
echo "GRANT ALL PRIVILEGES ON DATABASE druid TO druid;" | sudo -u postgres psql -U postgres
echo "GRANT ALL PRIVILEGES ON DATABASE registry TO registry;" | sudo -u postgres psql -U postgres
echo "GRANT ALL PRIVILEGES ON DATABASE ranger TO rangerdba;" | sudo -u postgres psql -U postgres
echo "GRANT ALL PRIVILEGES ON DATABASE ranger TO rangeradmin;" | sudo -u postgres psql -U postgres
#ambari-server setup --jdbc-db=postgres --jdbc-driver=/usr/share/java/postgresql-jdbc.jar
if [[ $(cat /etc/system-release|grep -Po Amazon) == "Amazon" ]]; then
echo '' > /var/lib/pgsql/9.5/data/pg_hba.conf
echo 'local all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry trust ' >> /var/lib/pgsql/9.5/data/pg_hba.conf
echo 'host all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry 0.0.0.0/0 trust ' >> /var/lib/pgsql/9.5/data/pg_hba.conf
echo 'host all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry ::/0 trust ' >> /var/lib/pgsql/9.5/data/pg_hba.conf
echo 'local all all peer ' >> /var/lib/pgsql/9.5/data/pg_hba.conf
echo 'host all all 127.0.0.1/32 ident ' >> /var/lib/pgsql/9.5/data/pg_hba.conf
echo 'host all all ::1/128 ident ' >> /var/lib/pgsql/9.5/data/pg_hba.conf
sudo -u postgres /usr/pgsql-9.5/bin/pg_ctl -D /var/lib/pgsql/9.5/data/ reload
else
echo '' > /var/lib/pgsql/data/pg_hba.conf
echo 'local all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry trust ' >> /var/lib/pgsql/data/pg_hba.conf
echo 'host all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry 0.0.0.0/0 trust ' >> /var/lib/pgsql/data/pg_hba.conf
echo 'host all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry ::/0 trust ' >> /var/lib/pgsql/data/pg_hba.conf
echo 'local all all peer ' >> /var/lib/pgsql/data/pg_hba.conf
echo 'host all all 127.0.0.1/32 ident ' >> /var/lib/pgsql/data/pg_hba.conf
echo 'host all all ::1/128 ident ' >> /var/lib/pgsql/data/pg_hba.conf
sudo -u postgres pg_ctl -D /var/lib/pgsql/data/ reload
fi
yum remove -y mysql57-community*
yum remove -y mysql56-server*
yum remove -y mysql-community*
rm -Rvf /var/lib/mysql
yum install -y epel-release
yum install -y libffi-devel.x86_64
ln -s /usr/lib64/libffi.so.6 /usr/lib64/libffi.so.5
yum install -y mysql-connector-java*
ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar
if [ $(cat /etc/system-release|grep -Po Amazon) == Amazon ]; then
yum install -y mysql56-server
service mysqld start
else
yum localinstall -y https://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
yum install -y mysql-community-server
systemctl start mysqld.service
fi
chkconfig --add mysqld
chkconfig mysqld on
ln -s /usr/share/java/mysql-connector-java.jar /usr/hdp/current/hive-client/lib/mysql-connector-java.jar
ln -s /usr/share/java/mysql-connector-java.jar /usr/hdp/current/hive-server2-hive2/lib/mysql-connector-java.jar
mysql --execute="CREATE DATABASE druid DEFAULT CHARACTER SET utf8"
mysql --execute="CREATE DATABASE registry DEFAULT CHARACTER SET utf8"
mysql --execute="CREATE DATABASE streamline DEFAULT CHARACTER SET utf8"
mysql --execute="CREATE DATABASE streamsmsgmgr DEFAULT CHARACTER SET utf8"
mysql --execute="CREATE USER 'das'@'localhost' IDENTIFIED BY 'dasuser'"
mysql --execute="CREATE USER 'das'@'%' IDENTIFIED BY 'dasuser'"
mysql --execute="CREATE USER 'ranger'@'localhost' IDENTIFIED BY 'ranger'"
mysql --execute="CREATE USER 'ranger'@'%' IDENTIFIED BY 'ranger'"
mysql --execute="CREATE USER 'rangerdba'@'localhost' IDENTIFIED BY 'rangerdba'"
mysql --execute="CREATE USER 'rangerdba'@'%' IDENTIFIED BY 'rangerdba'"
mysql --execute="CREATE USER 'registry'@'localhost' IDENTIFIED BY 'registry'"
mysql --execute="CREATE USER 'registry'@'%' IDENTIFIED BY 'registry'"
mysql --execute="CREATE USER 'streamsmsgmgr'@'localhost' IDENTIFIED BY 'streamsmsgmgr'"
mysql --execute="CREATE USER 'streamsmsgmgr'@'%' IDENTIFIED BY 'streamsmsgmgr'"
mysql --execute="CREATE USER 'druid'@'%' IDENTIFIED BY 'druid'"
mysql --execute="CREATE USER 'streamline'@'%' IDENTIFIED BY 'streamline'"
mysql --execute="CREATE USER 'streamline'@'localhost' IDENTIFIED BY 'streamline'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'das'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'das'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'das'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'das'@'%' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'%' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'rangerdba'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'rangerdba'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'rangerdba'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'rangerdba'@'%' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON druid.* TO 'druid'@'%' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'registry'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'registry'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'registry'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'registry'@'%' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'streamsmsgmgr'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'streamsmsgmgr'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'streamsmsgmgr'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'streamsmsgmgr'@'%' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON streamline.* TO 'streamline'@'%' WITH GRANT OPTION"
mysql --execute="CREATE DATABASE beast_mode_db DEFAULT CHARACTER SET utf8"
mysql --execute="CREATE USER 'bmq_user'@'localhost' IDENTIFIED BY 'Be@stM0de'"
mysql --execute="CREATE USER 'bmq_user'@'%' IDENTIFIED BY 'Be@stM0de'"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'%' WITH GRANT OPTION"
mysql --execute="FLUSH PRIVILEGES"
mysql --execute="COMMIT"
#remount tmpfs to ensure NOEXEC is disabled
if grep -Eq '^[^ ]+ /tmp [^ ]+ ([^ ]*,)?noexec[, ]' /proc/mounts; then
echo "/tmp found as noexec, remounting..."
mount -o remount,size=10G /tmp
mount -o remount,exec /tmp
else
echo "/tmp not found as noexec, skipping..."
fi Pre Ambari start recipe to grow the root volume for the CDSW worker #!/usr/bin/env bash
# WARNING: This script is only for RHEL7 on Azure
# growing the /dev/sda2 partition
sed -e 's/\s*\([\+0-9a-zA-Z]*\).*/\1/' << EOF | fdisk /dev/sda
d # delete
2 # delete partition 2
n # new
p # partition
2 # partition 2
# default
# default
w # write the partition table
q # and we're done
EOF
reboot Post cluster install recipe to setup CDSW #!/usr/bin/env bash
# WARNING: This script is only for RHEL7 on Azure
# growing the /dev/sda2 partition
xfs_growfs /dev/sda2
# Some of these installs may be unecessary but are included for completeness against documentation
yum -y install nfs-utils libseccomp lvm2 bridge-utils libtool-ltdl ebtables rsync policycoreutils-python ntp bind-utils nmap-ncat openssl e2fsprogs redhat-lsb-core socat selinux-policy-base selinux-policy-targeted
# CDSW wants a pristine IPTables setup
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
iptables -t nat -F
iptables -t mangle -F
iptables -F
iptables -X
# set java_home on centos7
#echo 'export JAVA_HOME=$(readlink -f /usr/bin/javac | sed "s:/bin/javac::")' >> /etc/profile
#export JAVA_HOME=$(readlink -f /usr/bin/javac | sed "s:/bin/javac::")
echo 'export JAVA_HOME=/usr/lib/jvm/java' >> /etc/profile
export JAVA_HOME='/usr/lib/jvm/java'
# Fetch public IP
export MASTER_IP=$(hostname --ip-address)
# Fetch public FQDN for Domain
export DOMAIN=$(curl https://ipv4.icanhazip.com)
cd /hadoopfs/
mkdir cdsw
# Install CDSW
#wget -q --no-check-certificate https://s3.eu-west-2.amazonaws.com/whoville/v2/temp.blob
#mv temp.blob cloudera-data-science-workbench-1.5.0.818361-1.el7.centos.x86_64.rpm
wget -q https://archive.cloudera.com/cdsw1/1.5.0/redhat7/yum/RPMS/x86_64/cloudera-data-science-workbench-1.5.0.849870-1.el7.centos.x86_64.rpm
yum install -y cloudera-data-science-workbench-1.5.0.849870-1.el7.centos.x86_64.rpm
# Install Anaconda
curl -Ok https://repo.anaconda.com/archive/Anaconda2-5.2.0-Linux-x86_64.sh
chmod +x ./Anaconda2-5.2.0-Linux-x86_64.sh
./Anaconda2-5.2.0-Linux-x86_64.sh -b -p /anaconda
# create unix user
useradd tutorial
echo "tutorial-password" | passwd --stdin tutorial
su - hdfs -c 'hdfs dfs -mkdir /user/tutorial'
su - hdfs -c 'hdfs dfs -chown tutorial:hdfs /user/tutorial'
# CDSW Setup
sed -i "s@MASTER_IP=\"\"@MASTER_IP=\"${MASTER_IP}\"@g" /etc/cdsw/config/cdsw.conf
sed -i "s@JAVA_HOME=\"/usr/java/default\"@JAVA_HOME=\"$(echo ${JAVA_HOME})\"@g" /etc/cdsw/config/cdsw.conf
sed -i "s@DOMAIN=\"cdsw.company.com\"@DOMAIN=\"${DOMAIN}.xip.io\"@g" /etc/cdsw/config/cdsw.conf
sed -i "s@DOCKER_BLOCK_DEVICES=\"\"@DOCKER_BLOCK_DEVICES=\"${DOCKER_BLOCK}\"@g" /etc/cdsw/config/cdsw.conf
sed -i "s@APPLICATION_BLOCK_DEVICE=\"\"@APPLICATION_BLOCK_DEVICE=\"${APP_BLOCK}\"@g" /etc/cdsw/config/cdsw.conf
sed -i "s@DISTRO=\"\"@DISTRO=\"HDP\"@g" /etc/cdsw/config/cdsw.conf
sed -i "s@ANACONDA_DIR=\"\"@ANACONDA_DIR=\"/anaconda/bin\"@g" /etc/cdsw/config/cdsw.conf
# CDSW will break default Amazon DNS on 127.0.0.1:53, so we use a different IP
sed -i "s@nameserver 127.0.0.1@nameserver 169.254.169.253@g" /etc/dhcp/dhclient-enter-hooks
cdsw init
echo "CDSW will shortly be available on ${DOMAIN}"
# after the init, we wait until we are able to create the tutorial user
export respCode=404
while (( $respCode != 201 ))
do
sleep 10
export respCode=$(curl -iX POST http://${DOMAIN}.xip.io/api/v1/users/ -H 'Content-Type: application/json' -d '{"email":"tutorial@tutorial.com","name":"tutorial","username":"tutorial","password":"tutorial-password","type":"user","admin":true}' | grep HTTP | awk '{print $2}')
done
exit 0 Note: this script is using xip.io and hacks into unix to create user and hadoop folders, not a recommendation in production! Management packs IMAGE You will need two management packs for this setup, using the URL detailed below: HDF mpack: http://s3.amazonaws.com/dev.hortonworks.com/HDF/centos7/3.x/BUILDS/3.3.1.0-10/tars/hdf_ambari_mp/hdf-ambari-mpack-3.3.1.0-10.tar.gz Search mpack: http://public-repo-1.hortonworks.com/HDP-SOLR/hdp-solr-ambari-mp/solr-service-mpack-4.0.0.tar.gz Step 3: Create cluster This step uses Cloudbreak's Create Cluster wizard, and is pretty self-explanatory following screenshots, but I will add specific parameters in text form for convenience Note: Do not forget to toggle the advanced mode when running the wizard (top of the screen) General Configuration Image Settings Hardware and Storage Note: Make sure to use 100 GB as the root volume size for CDSW. Network and Availability Cloud Storage Cluster Extensions External Sources Gateway Configuration Network Security Groups Security Result After the cluster created, you should have access to the following screen in Cloudbreak: You can now access Ambari via the link provided, and CDSW using http://[CDSW_WORKER_PUBLIC_IP].xip.io
... View more
03-06-2019
12:30 PM
4 Kudos
Introduction Before the official release of Cloudera Data Flow (ex. Hortonworks Data Flow), you may want to play with Nifi and Hive on CDH. However, because CDH 5 is using a fork of Hive 1.1, the HiveQL processors and controller services included on the official Apache release will not work, so you need to have your own, as explained in this article: Connecting NiFi to CDH Hive. This article is awesome, but does not focus on Kerberos/SSL; since I had to do the configuration myself, I thought I would share the knowledge. Note: You could use a DBCP connection to connect to Cloudera Hive but it will not allow you to use the proper authentication. Pre-Requisites To connect to Hive with SSL and Kerberos, you will need the following: A running version of Nifi (I used Apache 1.9 in this example) A Kerberized CDH cluster with Hive on SSL (I used CDH 5.15 in this example) The certificate to add to your keystore for SSL connection A keytab for a specific user authorized in the cluster The krb5 configuration file from the cluster hive-site.xml, core-site.xml and hdfs-site.xml from your cluster Nifi processors and services compiled for Hive 1.1 on CDH (can be compiled like described in the article I linked to) Step 1: Add certificate to Java truststore The goal of this step is to add your certificate to the Java cacerts that is used to run Nifi. In order to import your certificate, run the following command: keytool –importcert –alias HS2server -keystore [LOCATION_OF_CACERTS] –file [LOCATION_OF_YOUR_CERTIFICATE] I'm running on MacOS, so my cacerts is under /Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home/jre/lib/security/cacerts , so I ran: keytool –importcert –alias HS2server -keystore /Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home/jre/lib/security/cacerts –file /Users/pvidal/Documents/customers/quest/config/tls/rootCA.pem Step 2: Prepare Nifi Note: This step will require a Nifi restart, so I suggest to stop Nifi before following these instructions and then start it afterwards Add the krb5 conf file to Nifi Properties Go to your Nifi conf folder and modify the nifi.properties file to add the following: nifi.kerberos.krb5.file=[LOCATION_OF_YOUR_KRB5.CONF] Load the processors and services to Nifi Go to your Nifi lib folder and add the necessary NARs; I added the following: -rwxr-xr-x@ 1 pvidal admin 14800 Mar 5 16:39 nifi-hive-services-api-nar-1.9.0.1.0.0.0-49.nar -rwxr-xr-x@ 1 pvidal admin 164674666 Mar 5 16:39 nifi-hive_1_1-nar-1.9.0.1.0.0.0-49.nar Step 3: Configure Nifi Note: Remember to restart Nifi before this step. Configure a KeytabCredentialsService Go to your controller services, and add a new KeytabCredentialsService. Configure the service as such: Kerberos Keytab: [LOCATION_OF_YOUR_KEYTAB] Kerberos Principal: [NAME_OF_YOUR_PRINCIPAL] Enable the service. Configure a Hive_1_1ConnectionPool Go to your controller services, and add a new Hive_1_1ConnectionPool (from the NAR you imported). Configure the service as such: Database Connection URL: jdbc:hive2://[YOUR_HIVE_HOST]:10000/default;principal=hive/_HOST@[YOUR_DOMAIN,SAME AS PRINICPAL];ssl=true Hive Configuration Resources: [LOCATION_OF_HIVE_SITE.XML],[LOCATION_OF_CORE_SITE.XML],[LOCATION_OF_HDFS_SITE.XML] Kerberos Credentials Service: [YOUR_KEYTABCREDENTIALSSERVICE] Enable the service. Configure a simple flow I configured a simple flow that only contains: A SelectHive_1_1QL (from the NAR I imported) A convert Avro to JSON (to make it readable) A log message The only bit of configuration I had to do was referencing the Hive_1_1ConnectionPool I created earlier, as depicted below: Note: With the official release of CDF, all of this will be MUCH simpler, with no need for NAR import. If you're not excited about it, I am!
... View more
Labels:
02-14-2019
02:45 PM
Fair point @Aggelos Karalias!
... View more
02-13-2019
09:24 PM
4 Kudos
I have been playing quite a bit with CDSW lately. Here is a quick article on how to setup a CDSW project in scala connecting to an external RDBMS Step 1: Create a new CDSW Project Using the CDSW UI, create a new Scala Project: Step 2: Reference the external Jar in your spark-defaults.conf Open your project, and edit your spark-defaults.conf to add an external jar: spark.jars=http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.6/mysql-connector-java-5.1.6.jar Step 3: Create a simple Scala file to connect to the DB Create a new file and add this code in it: val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://[YOUR_SERVER_IP]:3306/[YOUR_DB]").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "[YOUR_TABLE]").option("user", "[YOUR_USER]").option("password", "[YOUR_PWD]").load()
df.show() Step 4: Run your application Launch a session and run your code:
... View more
01-31-2019
06:16 PM
4 Kudos
Introduction
Finally, we get to put all this together! This is less of a tutorial, and more of a show case of the capabilities of this solution. Please refere to previous articles to build this on your own.
The meat of this article is showcasing the React/NodeJS application you can build to communicate with Nifi, Cloudbreak and your ephemeral cluster running Zeppelin and Spark.
Architecture
Below is a high-level architecture of the solution:
You can find all the code for this here: https://github.com/paulvid/bmq-app
The code includes:
Latest Zeppelin Prediction Notebook
All Nifi templates
All Blueprints/Recipes for Cloudbreak
The source code of the app
Agenda
This article is divided in the following sections:
Section 1: Monitoring your current data with Nifi
Section 2: Launching an ephemeral cluster with Cloudbreak
Section 3: Running a prediction model with Spark Zeppelin
Section 1: Monitoring your current data with Nifi
Section 2: Launching an ephemeral cluster with Cloudbreak
Section 3: Running a prediction model with Spark Zeppelin
... View more
01-31-2019
05:13 PM
3 Kudos
Introduction
It's been a while, but I'm finally finishing this series on Best Mode Quotient.
This article is fairly straight forward as we look how to build the infrastructure of model training using Zeppelin and Spark. Pre-Requisites
In order to run this install, you will have to have deployed a BMQ ephemeral cluster as detailed in this article and this repository.
Moreover you will have to have gathered data on your fitness as explained in part 1 of these series.
Based on this data, I created 3 indexes:
Intensity Index: How intense was the workout I did that day (based on distance, pace and elevation)
Fatigue Index: How much rest I had on that day (based on hours of sleep and rest heart rate)
BMQ Index: How my BMQ fairs compare to the max BMQ (5)
You can then agglomerate these 3 indexes, having BMQ and Fatigue on the same day correlating to the Intensity Index of the previous day. All data will be shared in the last article of the series.
You will also notice that the model uses parameterized sleep and rest HR. The whole flow is to be revealed in part 4 🙂 Agenda
This tutorial is divided in the following sections:
Section 1: Create a mysql interpreter for JDBC in Zeppelin
Section 2: Create training set for BMQ prediction
Section 3: Create a prediction model
Section 4: Save the results in a table Section 1: Create a mysql interpreter for JDBC in Zeppelin
Login to Zeppelin, then go to top right corner > Admin > Interpreter and edit the jdbc interpreter. Add the following parameters:
mysql.driver: com.mysql.jdbc.Driver
mysql.url: jdbc:mysql://localhost:3306/beast_mode_db
mysql.user: bmq_user
mysql.password: Be@stM0de
Add artifact: mysql:mysql-connector-java:5.1.38
Restart the interpreter and you should be good to go. Section 2: Create training set for BMQ prediction Create a new note called BMQ Predictions and add the following code, using the jdbc interpreter you just built. Delete existing training tables if any %jdbc(mysql)
drop table if exists training_set Create training set based on fatigue and intensity indexes %jdbc(mysql)
create table training_set as (
select @rowid:=@rowid+1 as rowid, bmq_index.date, bmq_index, fatigue_index, intensity_index
from bmq_index, fatigue_index, intensity_index, (select @rowid:=0) as init
where bmq_index.date = fatigue_index.date
and date_sub(bmq_index.date, INTERVAL 1 DAY) = intensity_index.date
order by bmq_index.date asc) View Data %jdbc(mysql)
select * from training_set Delete existing prediction tables if any %jdbc(mysql)
drop table if exists prediction Create a table we want to apply the algo against
%jdbc(mysql)
create table prediction as (
select date(training_date) as date, estimated_intensity_index,
round((
(1-((select (sleep_hours*60) from PREDICTION_PARAMETERS)/(select max(TOTAL_MINUTES_ASLEEP) from SLEEP_HISTORY)))*0.6 +
(1-((select min(REST_HR) from HEALTH_HISTORY)/(select rest_hr from PREDICTION_PARAMETERS)))*0.4
) *100,2) as estimated_fatigue_index,
0.0 as predicted_bmq
from training_plan) View Data %jdbc(mysql)
select * from prediction Section 3: Create a prediction model DISCLAIMER: This model needs to be worked on; the purpose of this article is to establish the principal architecture, not give the final most tuned model as I plan on improving on it. This part uses the spark interpreter of Zeppelin to vectorize, normalize and train a model Create dataframe from MySQL tables: %spark2
import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.feature.Interaction
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.Normalizer
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/beast_mode_db").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "training_set").option("user", "bmq_user").option("password", "Be@stM0de").load()
df.show()
%spark2
val target_df = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/beast_mode_db").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "prediction").option("user", "bmq_user").option("password", "Be@stM0de").load()
target_df.show() Vectorize Dataframes
val assembler1 = new VectorAssembler().
setInputCols(Array( "fatigue_index","intensity_index")).
setOutputCol("features").
transform(df)
assembler1.show() %spark2
val assembler2 = new VectorAssembler().
setInputCols(Array( "estimated_fatigue_index","estimated_intensity_index")).
setOutputCol("features").
transform(target_df)
assembler2.show() Normalize Dataframes %spark2
val normalizer = new Normalizer()
.setInputCol("features")
.setOutputCol("normFeatures")
.setP(2.0)
.transform(assembler1)
normalizer.show()
%spark2
val targetNormalizer = new Normalizer()
.setInputCol("features")
.setOutputCol("normFeatures")
.setP(2.0)
.transform(assembler2)
targetNormalizer.show() Train and evaluate Model %spark2
val Array(trainingData, testData) = normalizer.randomSplit(Array(0.7, 0.3))
%spark2
val lr = new LinearRegression()
.setLabelCol("bmq_index")
.setFeaturesCol("normFeatures")
.setMaxIter(10)
.setRegParam(1.0)
.setElasticNetParam(1.0)
val lrModel = lr.fit(trainingData)
lrModel.transform(testData).select("features","normFeatures", "bmq_index", "prediction").show() %spark2
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") %spark2
val trainingSummary = lrModel.summary
println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}") %spark2
val targetTable = lrModel.transform(targetNormalizer).select("date", "estimated_intensity_index", "estimated_fatigue_index", "prediction")
targetTable.show()
Section 4: Save the results in a table Finally, we will take the results of the prediction and put them in a table called BMQ_PREDICTIONS for later use: Rename the dataframe to match target columns name %spark2
val newNames = Seq("date", "estimated_intensity_index", "estimated_fatigue_index", "predicted_bmq")
val targetTableRenamed = targetTable.toDF(newNames: _*)
Delete target table if exists %jdbc(mysql)
drop table if exists BMQ_PREDICTIONS Write data %spark2
val prop = new java.util.Properties
prop.setProperty("driver", "com.mysql.jdbc.Driver")
prop.setProperty("user", "bmq_user")
prop.setProperty("password", "Be@stM0de")
targetTableRenamed.write.mode("append").jdbc("jdbc:mysql://localhost:3306/beast_mode_db", "BMQ_PREDICTIONS", prop) View data %jdbc(mysql)
select * from BMQ_PREDICTIONS
... View more
01-17-2019
06:10 PM
1 Kudo
Here is a self explanatory tutorial. The official Cloudbreak documentation explains how to retrieve your Cloudbreak token via curl or python (see here). I have been playing with Node.JS on my side, so here is how to do it in Node.JS using request: process.env["NODE_TLS_REJECT_UNAUTHORIZED"] = 0;
var request = require("request");
var options = { method: 'POST',
url: 'https://[YOUR_CB_URL]/identity/oauth/authorize',
qs:
{ response_type: 'token',
client_id: 'cloudbreak_shell',
'scope.0': 'openid',
source: 'login',
redirect_uri: 'http://cloudbreak.shell',
accept: 'application/x-www-form-urlencoded' },
headers:
{ 'Content-Type': 'application/x-www-form-urlencoded',
accept: 'application/x-www-form-urlencoded' },
body: 'credentials={"username":"[YOUR_USER]","password":"[YOUR_PASSWORD]"' };
request(options, function (error, response, body) {
if (error) throw new Error(error);
const querystring = require('querystring');
console.log(querystring.parse(response.headers['location'])['access_token']);
});
... View more
Labels:
01-04-2019
03:18 PM
3 Kudos
DISCLAIMER: This is a method that is not recommended for any production use, only for development purposes. Only Cloudbreak official versions are supported, and upgrading to a non-supported version could cause unforeseen issues, and loss of official Hortonworks support. Moreover, this method only works for upgrades, not downgrades; downgrades would result in data loss.
Recently, I have been playing with Cloudbreak and have been needing to upgrade my local version. While it follows the principles detailed in Hortonworks documentation, I thought I'd share a quick step by step guide on how to upgrade to any version of Cloudbreak.
1. Stop Cloudbreak on your machine:
cbd kill
1. Go to https://mvnrepository.com/artifact/com.sequenceiq/cloudbreak and find the version you need (e.g. 2.8.1-rc.48)
2. On the VM where Cloudbreak is running, navigate to the directory where your Profile file is located. For example:
cd /var/lib/cloudbreak-deployment/
2. Run the following commands to download the binary:
export CBD_VERSION=2.8.1-rc.48
curl -Ls public-repo-1.hortonworks.com/HDP/cloudbreak/cloudbreak-deployer_${CBD_VERSION}_$(uname)_x86_64.tgz | tar -xz -C /bin cbd
3. Verify the version:
cbd version
4. Regenerate assets:
cbd regenerate
5. Restart Cloudbreak:
cbd start
... View more
Labels:
12-21-2018
03:07 PM
3 Kudos
Introduction
Continuing my series on Beast Mode Quotient, let's automate the creation and termination of data science ready clusters. As always, since this is step 2 of a series of article, this tutorial depends on my
previous article.
Pre-Requisites
In order to run this tutorial you will need to have a CB instance available. There are plenty of good tutorial out there, I recommend
this one
Agenda
This tutorial is divided in the following sections:
Section 1: Create a blueprints & recipes to run a minimal data science ephemeral cluster
Section 2: Add blueprint and recipes via Cloudbreak interface
Section 3: Automate cluster launch and terminate clusters
Section 1: Create a blueprints & recipes to run a minimal data science ephemeral cluster
A Cloudbreak blueprint has 3 parts:
Part 1: Blueprint details
Part 2: Services Configuration
Part 3: Host Components configuration
Here are the details of each part for our blueprint.
Blueprint details
This part is fairly simple; we want to run an HDP 3.1 cluster, and I'm naming it bmq-data-science.
"Blueprints": {
"blueprint_name": "bmq-data-science",
"stack_name": "HDP",
"stack_version": "3.1"
}
Host Components configuration
Similarly, I configured one Host with all the components needed for my services
"host_groups": [
{
"name": "master",
"cardinality": "1",
"components": [
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "NAMENODE"
},
{
"name": "SECONDARY_NAMENODE"
},
{
"name": "RESOURCEMANAGER"
},
{
"name": "HISTORYSERVER"
},
{
"name": "APP_TIMELINE_SERVER"
},
{
"name": "LIVY2_SERVER"
},
{
"name": "SPARK2_CLIENT"
},
{
"name": "SPARK2_JOBHISTORYSERVER"
},
{
"name": "ZEPPELIN_MASTER"
},
{
"name": "METRICS_GRAFANA"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "DATANODE"
},
{
"name": "HIVE_SERVER"
},
{
"name": "HIVE_METASTORE"
},
{
"name": "HIVE_CLIENT"
},
{
"name": "YARN_CLIENT"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "NODEMANAGER"
},
{
"name": "MAPREDUCE2_CLIENT"
}
]
}
]
Services Configuration
For our purposes, I want to create a minimum cluster that will run YARN, HDFS, HIVE, SPARK and ZEPPELIN (plus all the necessary compute engines behind it). I therefore configured these services according to the Cloudbreak examples that are available in the default Cloudbreak blueprints:
"configurations": [
{
"yarn-site": {
"properties": {
"yarn.nodemanager.resource.cpu-vcores": "6",
"yarn.nodemanager.resource.memory-mb": "23296",
"yarn.scheduler.maximum-allocation-mb": "23296"
}
}
},
{
"core-site": {
"properties_attributes": {},
"properties": {
"fs.s3a.threads.max": "1000",
"fs.s3a.threads.core": "500",
"fs.s3a.max.total.tasks": "1000",
"fs.s3a.connection.maximum": "1500"
}
}
},
{
"capacity-scheduler": {
"properties": {
"yarn.scheduler.capacity.root.queues": "default",
"yarn.scheduler.capacity.root.capacity": "100",
"yarn.scheduler.capacity.root.maximum-capacity": "100",
"yarn.scheduler.capacity.root.default.capacity": "100",
"yarn.scheduler.capacity.root.default.maximum-capacity": "100"
}
}
},
{
"spark2-defaults": {
"properties_attributes": {},
"properties": {
"spark.sql.hive.hiveserver2.jdbc.url": "jdbc:hive2://%HOSTGROUP::master%:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2",
"spark.sql.hive.hiveserver2.jdbc.url.principal": "hive/_HOST@EC2.INTERNAL",
"spark.datasource.hive.warehouse.metastoreUri": "thrift://%HOSTGROUP::master%:9083",
"spark.datasource.hive.warehouse.load.staging.dir": "/tmp",
"spark.hadoop.hive.zookeeper.quorum": "%HOSTGROUP::master%:2181"
}
}
},
{
"hive-site": {
"hive.metastore.warehouse.dir": "/apps/hive/warehouse",
"hive.exec.compress.output": "true",
"hive.merge.mapfiles": "true",
"hive.server2.tez.initialize.default.sessions": "true",
"hive.server2.transport.mode": "http",
"hive.metastore.dlm.events": "true",
"hive.metastore.transactional.event.listeners": "org.apache.hive.hcatalog.listener.DbNotificationListener",
"hive.repl.cm.enabled": "true",
"hive.repl.cmrootdir": "/apps/hive/cmroot",
"hive.repl.rootdir": "/apps/hive/repl"
}
},
{
"hdfs-site": {
"properties_attributes": {},
"properties": {
}
}
}
]
You can find the complete blueprint and other recipes on my github, here
Creating a recipe
Recipes in Cloubreak are very useful and allow you to run scripts before a cluster is launched or after. In this example, I created a PRE-AMBARI-START recipe that creates the appropriate Postgres and MySQL services on my master box, as well as recreating the DB for my BMQ analysis:
#!/bin/bash
# Cloudbreak-2.7.2 / Ambari-2.7.0 - something is install pgsq95
yum remove -y postgresql95*
# Install pgsql96
yum install -y https://download.postgresql.org/pub/repos/yum/9.6/redhat/rhel-7-x86_64/pgdg-redhat96-9.6-3.noarch.rpm
yum install -y postgresql96-server
yum install -y postgresql96-contrib
/usr/pgsql-9.6/bin/postgresql96-setup initdb
sed -i 's,#port = 5432,port = 5433,g' /var/lib/pgsql/9.6/data/postgresql.conf
systemctl enable postgresql-9.6.service
systemctl start postgresql-9.6.service
yum remove -y mysql57-community*
yum remove -y mysql56-server*
yum remove -y mysql-community*
rm -Rvf /var/lib/mysql
yum install -y epel-release
yum install -y libffi-devel.x86_64
ln -s /usr/lib64/libffi.so.6 /usr/lib64/libffi.so.5
yum install -y mysql-connector-java*
ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar
if [ $(cat /etc/system-release|grep -Po Amazon) == Amazon ]; then
yum install -y mysql56-server
service mysqld start
else
yum localinstall -y https://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
yum install -y mysql-community-server
systemctl start mysqld.service
fi
chkconfig --add mysqld
chkconfig mysqld on
ln -s /usr/share/java/mysql-connector-java.jar /usr/hdp/current/hive-client/lib/mysql-connector-java.jar
ln -s /usr/share/java/mysql-connector-java.jar /usr/hdp/current/hive-server2-hive2/lib/mysql-connector-java.jar
mysql --execute="CREATE DATABASE beast_mode_db DEFAULT CHARACTER SET utf8"
mysql --execute="CREATE USER 'bmq_user'@'localhost' IDENTIFIED BY 'Be@stM0de'"
mysql --execute="CREATE USER 'bmq_user'@'%' IDENTIFIED BY 'Be@stM0de'"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'%' WITH GRANT OPTION"
mysql --execute="FLUSH PRIVILEGES"
mysql --execute="COMMIT"
Section 2: Add blueprint and recipes via Cloudbreak interface
This part is super simple. Follow the User interface to load the files you just created, as depicted below
Adding a blueprint
Adding a recipe
Section 3: Automate cluster launch and terminate clusters
This is where the fun begins. For this part I created two scripts for launching and terminating the ephemeral cluster (that will then be called by the BMQ app). Both scripts rely on the cb cli that you can download from your CB instance:
Launch Cluster Script
As you will see below, the script is divided in the following part:
Part 1: Reference the location of the cb-cli for ease of use
Part 2: Dumps the content of my long lasting cluster to a recipe that will load them
Part 3: Use CB api to add the recipe to my CB instance
Part 4: Launch the cluster via cb-cli
#!/bin/bash
###############################
# 0. Initializing environment #
###############################
export PATH=$PATH:/Users/pvidal/Documents/Playground/cb-cli/
###################################################
# 1. Dumping current data and adding it to recipe #
###################################################
rm -rf poci-bmq-data-science.sh >/dev/null 2>&1
echo "mysql -u bmq_user -pBe@stM0de beast_mode_db --execute=\"""$(mysqldump -u bm_user -pHWseftw33# beast_mode_db 2> /dev/null)""\"" >> poci-bmq-data-science.sh
##################################
# 2. Adding recipe to cloudbreak #
##################################
TOKEN=$(curl -k -iX POST -H "accept: application/x-www-form-urlencoded" -d 'credentials={"username":"pvidal@hortonworks.com","password":"HWseftw33#"}' "https://192.168.56.100/identity/oauth/authorize?response_type=token&client_id=cloudbreak_shell&scope.0=openid&source=login&redirect_uri=http://cloudbreak.shell" | grep location | cut -d'=' -f 3 | cut -d'&' -f 1)
echo $TOKEN
ENCODED_RECIPE=$(base64 poci-bmq-data-science.sh)
curl -X DELETE https://192.168.56.100/cb/api/v1/recipes/user/poci-bmq-data-science -H "Authorization: Bearer $TOKEN" -k
curl -X POST https://192.168.56.100/cb/api/v1/recipes/user -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -H 'cache-control: no-cache' -d " {
\"name\": \"poci-bmq-data-science\",
\"description\": \"Recipe loading BMQ data post BMQ cluster launch\",
\"recipeType\": \"POST_CLUSTER_INSTALL\",
\"content\": \"$POST_CLUSTER_INSTALL\"
}" -k
########################
# 3. Launching cluster #
########################
cb cluster create --cli-input-json tp-bmq-data-science.json --name bmq-data-science-$(date +%s)
Terminate Cluster Script
This script is much simpler; it uses cb cli to list the clusters running and terminate them:
#!/bin/bash
###############################
# 0. Initializing environment #
###############################
export PATH=$PATH:/Users/pvidal/Documents/Playground/cb-cli/
################################################
# 1. Get a list of clusters and terminate them #
################################################
cb cluster list | grep Name | awk -F \" '{print $4}' | while read cluster; do
echo "Terminating ""$cluster""..."
cb cluster delete --name $cluster
done
Conclusion
With this framework, I'm able to launch and terminate clusters in a matter of minutes, as depicted below. Next step will be to model a way to calculate accurate prediction for BMQ!
... View more