Member since
07-10-2018
63
Posts
71
Kudos Received
0
Solutions
04-23-2019
06:59 PM
1 Kudo
Introduction Overview Cloudera/Hortonworks offers today one of the most comprehensive data management platform, with components allowing you data flow management to governed and distributed Data Science workloads. With so many toys to play with, I thought I'd share an easy way to setup a simple cluster that will, using Cloudbreak, setup the following main components on Azure cloud: Hortonworks Data Platform 3.1 Hortonworks Data Flow 3.3 Data Platform Search 4.0 Cloudera Data Science Workbench 1.5 Note: This is not a production-ready setup, but merely a first step to customizing your deployment using the Cloudera toolkit. Pre-Requisites Account on Azure with permission to assign roles Cloudbreak 2.9 (you can always set one up on your machine using: https://community.hortonworks.com/articles/194076/using-vagrant-and-virtualbox-to-create-a-local-ins.html) Tutorial steps Step 1: Setup Azure Credentials Step 2: Setup blueprint and cluster extensions Step 3: Create cluster Step 1: Setup Azure Credentials Find your Azure subscription and tenant ID To find your subscription ID, go to the search box and look for subscription; you should find it as such: For the tenant ID, use the Azure AD Directory ID: Setup your credentials in Cloudbreak This part is extremely well documented in Cloudbreak's documentation portal: https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-2.9.0/create-credential-azure/content/cb_create-app-based-credential.html. Note: Because of IT restrictions on my side, I chose to use an app based credential setup, but if you have enough privileges, Cloudbreak creates the app and assigns roles automagically for you. Step 2: Setup blueprint and cluster extensions Blueprint First upload the blueprint below: {
"Blueprints": {
"blueprint_name": "edge-to-ai-3.1",
"stack_name": "HDP",
"stack_version": "3.1"
},
"configurations": [
{
"yarn-site": {
"properties": {
"yarn.nodemanager.resource.cpu-vcores": "6",
"yarn.nodemanager.resource.memory-mb": "60000",
"yarn.scheduler.maximum-allocation-mb": "14"
}
}
},
{
"hdfs-site": {
"properties": {
"dfs.cluster.administrators": "hdfs"
}
}
},
{
"capacity-scheduler": {
"properties": {
"yarn.scheduler.capacity.maximum-am-resource-percent": "0.4",
"yarn.scheduler.capacity.root.capacity": "67",
"yarn.scheduler.capacity.root.default.capacity": "67",
"yarn.scheduler.capacity.root.default.maximum-capacity": "67",
"yarn.scheduler.capacity.root.llap.capacity": "33",
"yarn.scheduler.capacity.root.llap.maximum-capacity": "33",
"yarn.scheduler.capacity.root.queues": "default,llap"
}
}
},
{
"ranger-hive-audit": {
"properties": {
"xasecure.audit.destination.hdfs.file.rollover.sec": "300"
},
"properties_attributes": {}
}
},
{
"hive-site": {
"hive.exec.compress.output": "true",
"hive.merge.mapfiles": "true",
"hive.metastore.dlm.events": "true",
"hive.metastore.transactional.event.listeners": "org.apache.hive.hcatalog.listener.DbNotificationListener",
"hive.repl.cm.enabled": "true",
"hive.repl.cmrootdir": "/apps/hive/cmroot",
"hive.repl.rootdir": "/apps/hive/repl",
"hive.server2.tez.initialize.default.sessions": "true",
"hive.server2.transport.mode": "http"
}
},
{
"hive-interactive-env": {
"enable_hive_interactive": "true",
"hive_security_authorization": "Ranger",
"num_llap_nodes": "1",
"num_llap_nodes_for_llap_daemons": "1",
"num_retries_for_checking_llap_status": "50"
}
},
{
"hive-interactive-site": {
"hive.exec.orc.split.strategy": "HYBRID",
"hive.llap.daemon.num.executors": "5",
"hive.metastore.rawstore.impl": "org.apache.hadoop.hive.metastore.cache.CachedStore",
"hive.stats.fetch.bitvector": "true"
}
},
{
"spark2-defaults": {
"properties": {
"spark.datasource.hive.warehouse.load.staging.dir": "/tmp",
"spark.datasource.hive.warehouse.metastoreUri": "thrift://%HOSTGROUP::master1%:9083",
"spark.hadoop.hive.zookeeper.quorum": "{{zookeeper_quorum_hosts}}",
"spark.sql.hive.hiveserver2.jdbc.url": "jdbc:hive2://{{zookeeper_quorum_hosts}}:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive",
"spark.sql.hive.hiveserver2.jdbc.url.principal": "hive/_HOST@EC2.INTERNAL"
},
"properties_attributes": {}
}
},
{
"gateway-site": {
"properties": {
"gateway.path": "{{cluster_name}}"
},
"properties_attributes": {}
}
},
{
"admin-topology": {
"properties": {
"content": "\n \n\n \n\n \n authentication\n ShiroProvider\n true\n \n sessionTimeout\n 30\n \n \n main.ldapRealm\n org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm\n \n \n main.ldapRealm.userDnTemplate\n uid={0},ou=people,dc=hadoop,dc=apache,dc=org\n \n \n main.ldapRealm.contextFactory.url\n ldap://54.219.163.9:33389\n \n \n main.ldapRealm.contextFactory.authenticationMechanism\n simple\n \n \n urls./**\n authcBasic\n \n \n\n \n authorization\n AclsAuthz\n true\n \n \n\n \n KNOX\n \n\n "
},
"properties_attributes": {}
}
},
{
"ranger-admin-site": {
"properties": {
"ranger.jpa.jdbc.url": "jdbc:postgresql://localhost:5432/ranger"
},
"properties_attributes": {}
}
},
{
"ranger-env": {
"properties": {
"is_solrCloud_enabled": "true",
"keyadmin_user_password": "{{{ general.password }}}",
"ranger-atlas-plugin-enabled": "Yes",
"ranger-hdfs-plugin-enabled": "Yes",
"ranger-hive-plugin-enabled": "Yes",
"ranger-knox-plugin-enabled": "Yes",
"ranger_admin_password": "{{{ general.password }}}",
"rangertagsync_user_password": "{{{ general.password }}}",
"rangerusersync_user_password": "{{{ general.password }}}"
},
"properties_attributes": {}
}
},
{
"ams-hbase-site": {
"properties": {
"hbase.cluster.distributed": "true",
"hbase.rootdir": "file:///hadoopfs/fs1/metrics/hbase/data"
}
}
},
{
"atlas-env": {
"properties": {
"atlas.admin.password": "admin",
"atlas_solr_shards": "2",
"content": "\n # The java implementation to use. If JAVA_HOME is not found we expect java and jar to be in path\n export JAVA_HOME={{java64_home}}\n\n # any additional java opts you want to set. This will apply to both client and server operations\n {% if security_enabled %}\n export ATLAS_OPTS=\"{{metadata_opts}} -Djava.security.auth.login.config={{atlas_jaas_file}}\"\n {% else %}\n export ATLAS_OPTS=\"{{metadata_opts}}\"\n {% endif %}\n\n # metadata configuration directory\n export ATLAS_CONF={{conf_dir}}\n\n # Where log files are stored. Defatult is logs directory under the base install location\n export ATLAS_LOG_DIR={{log_dir}}\n\n # additional classpath entries\n export ATLASCPPATH={{metadata_classpath}}\n\n # data dir\n export ATLAS_DATA_DIR={{data_dir}}\n\n # pid dir\n export ATLAS_PID_DIR={{pid_dir}}\n\n # hbase conf dir\n export HBASE_CONF_DIR=\"/etc/ams-hbase/conf\"\n\n # Where do you want to expand the war file. By Default it is in /server/webapp dir under the base install dir.\n export ATLAS_EXPANDED_WEBAPP_DIR={{expanded_war_dir}}\n export ATLAS_SERVER_OPTS=\"-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$ATLAS_LOG_DIR/atlas_server.hprof -Xloggc:$ATLAS_LOG_DIRgc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps\"\n {% if java_version == 8 %}\n export ATLAS_SERVER_HEAP=\"-Xms{{atlas_server_xmx}}m -Xmx{{atlas_server_xmx}}m -XX:MaxNewSize={{atlas_server_max_new_size}}m -XX:MetaspaceSize=100m -XX:MaxMetaspaceSize=512m\"\n {% else %}\n export ATLAS_SERVER_HEAP=\"-Xms{{atlas_server_xmx}}m -Xmx{{atlas_server_xmx}}m -XX:MaxNewSize={{atlas_server_max_new_size}}m -XX:MaxPermSize=512m\"\n {% endif %}\n",
"hbase_conf_dir": "/etc/ams-hbase/conf"
}
}
},
{
"kafka-broker": {
"properties": {
"default.replication.factor": "1",
"offsets.topic.replication.factor": "1"
},
"properties_attributes": {}
}
},
{
"hbase-env": {
"properties": {
"phoenix_sql_enabled": "true"
},
"properties_attributes": {}
}
},
{
"druid-common": {
"properties": {
"druid.extensions.loadList": "[\"postgresql-metadata-storage\", \"druid-datasketches\", \"druid-hdfs-storage\", \"druid-kafka-indexing-service\", \"ambari-metrics-emitter\"]",
"druid.indexer.logs.directory": "/user/druid/logs",
"druid.indexer.logs.type": "hdfs",
"druid.metadata.storage.connector.connectURI": "jdbc:postgresql://%HOSTGROUP::master1%:5432/druid",
"druid.metadata.storage.connector.password": "druid",
"druid.metadata.storage.connector.user": "druid",
"druid.metadata.storage.type": "postgresql",
"druid.selectors.indexing.serviceName": "druid/overlord",
"druid.storage.storageDirectory": "/user/druid/data",
"druid.storage.type": "hdfs"
},
"properties_attributes": {}
}
},
{
"druid-overlord": {
"properties": {
"druid.indexer.runner.type": "remote",
"druid.indexer.storage.type": "metadata",
"druid.port": "8090",
"druid.service": "druid/overlord"
},
"properties_attributes": {}
}
},
{
"druid-middlemanager": {
"properties": {
"druid.indexer.runner.javaOpts": "-server -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dhdp.version={{stack_version}} -Dhadoop.mapreduce.job.classloader=true",
"druid.port": "8091",
"druid.processing.numThreads": "2",
"druid.server.http.numThreads": "50",
"druid.service": "druid/middlemanager",
"druid.worker.capacity": "3"
},
"properties_attributes": {}
}
},
{
"druid-coordinator": {
"properties": {
"druid.coordinator.merge.on": "false",
"druid.port": "8081"
},
"properties_attributes": {}
}
},
{
"druid-historical": {
"properties": {
"druid.port": "8083",
"druid.processing.numThreads": "2",
"druid.server.http.numThreads": "50",
"druid.server.maxSize": "300000000000",
"druid.service": "druid/historical"
},
"properties_attributes": {}
}
},
{
"druid-broker": {
"properties": {
"druid.broker.http.numConnections": "5",
"druid.cache.type": "local",
"druid.port": "8082",
"druid.processing.numThreads": "2",
"druid.server.http.numThreads": "50",
"druid.service": "druid/broker"
},
"properties_attributes": {}
}
},
{
"druid-router": {
"properties": {},
"properties_attributes": {}
}
},
{
"superset": {
"properties": {
"SECRET_KEY": "{{{ general.password }}}",
"SUPERSET_DATABASE_TYPE": "sqlite"
},
"properties_attributes": {}
}
},
{
"nifi-ambari-config": {
"nifi.max_mem": "4g",
"nifi.security.encrypt.configuration.password": "{{{ general.password }}}",
"nifi.sensitive.props.key": "{{{ general.password }}}"
}
},
{
"nifi-properties": {
"nifi.security.user.login.identity.provider": "",
"nifi.sensitive.props.key": "{{{ general.password }}}"
}
},
{
"nifi-registry-ambari-config": {
"nifi.registry.security.encrypt.configuration.password": "{{{ general.password }}}"
}
},
{
"nifi-registry-properties": {
"nifi.registry.db.password": "{{{ general.password }}}",
"nifi.registry.sensitive.props.key": "{{{ general.password }}}"
}
},
{
"registry-common": {
"properties": {
"adminPort": "7789",
"database_name": "registry",
"jar.storage": "/hdf/registry",
"jar.storage.hdfs.url": "hdfs://localhost:9090",
"jar.storage.type": "local",
"port": "7788",
"registry.schema.cache.expiry.interval": "3600",
"registry.schema.cache.size": "10000",
"registry.storage.connector.connectURI": "jdbc:mysql://localhost:3306/registry",
"registry.storage.connector.password": "registry",
"registry.storage.connector.user": "registry",
"registry.storage.query.timeout": "30",
"registry.storage.type": "mysql"
},
"properties_attributes": {}
}
},
{
"hbase-site": {
"properties": {
"hbase.bucketcache.combinedcache.enabled": "true",
"hbase.bucketcache.ioengine": "file:/hbase/cache",
"hbase.bucketcache.size": "24000",
"hbase.defaults.for.version.skip": "true",
"hbase.hregion.max.filesize": "21474836480",
"hbase.hregion.memstore.flush.size": "536870912",
"hbase.region.server.rpc.scheduler.factory.class": "org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory",
"hbase.regionserver.global.memstore.size": "0.4",
"hbase.regionserver.handler.count": "60",
"hbase.regionserver.wal.codec": "org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec",
"hbase.rootdir": "/apps/hbase",
"hbase.rpc.controllerfactory.class": "org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory",
"hbase.rs.cacheblocksonwrite": "true",
"hfile.block.bloom.cacheonwrite": "true",
"hfile.block.cache.size": "0.4",
"hfile.block.index.cacheonwrite": "true",
"phoenix.functions.allowUserDefinedFunctions": "true",
"phoenix.query.timeoutMs": "60000"
},
"properties_attributes": {}
}
},
{
"hbase-env": {
"properties": {
"hbase_java_io_tmpdir": "/tmp",
"hbase_log_dir": "/var/log/hbase",
"hbase_master_heapsize": "1024m",
"hbase_pid_dir": "/var/run/hbase",
"hbase_regionserver_heapsize": "16384m",
"hbase_regionserver_shutdown_timeout": "30",
"hbase_regionserver_xmn_max": "16384",
"hbase_regionserver_xmn_ratio": "0.2",
"hbase_user": "hbase",
"hbase_user_nofile_limit": "32000",
"hbase_user_nproc_limit": "16000",
"phoenix_sql_enabled": "true"
},
"properties_attributes": {}
}
}
],
"host_groups": [
{
"cardinality": "1",
"components": [
{
"name": "RANGER_TAGSYNC"
},
{
"name": "RANGER_USERSYNC"
},
{
"name": "RANGER_ADMIN"
},
{
"name": "KNOX_GATEWAY"
},
{
"name": "HIVE_SERVER"
},
{
"name": "HIVE_METASTORE"
},
{
"name": "DRUID_OVERLORD"
},
{
"name": "DRUID_COORDINATOR"
},
{
"name": "DRUID_ROUTER"
},
{
"name": "DRUID_BROKER"
},
{
"name": "SECONDARY_NAMENODE"
},
{
"name": "HISTORYSERVER"
},
{
"name": "APP_TIMELINE_SERVER"
},
{
"name": "REGISTRY_SERVER"
},
{
"name": "NIFI_REGISTRY_MASTER"
},
{
"name": "DATANODE"
},
{
"name": "YARN_CLIENT"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "INFRA_SOLR_CLIENT"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "MAPREDUCE2_CLIENT"
},
{
"name": "ATLAS_CLIENT"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "HIVE_CLIENT"
},
{
"name": "LOGSEARCH_LOGFEEDER"
},
{
"name": "SPARK2_CLIENT"
}
],
"name": "master1"
},
{
"cardinality": "1+",
"components": [
{
"name": "NODEMANAGER"
},
{
"name": "DATANODE"
},
{
"name": "HBASE_REGIONSERVER"
},
{
"name": "MAPREDUCE2_CLIENT"
},
{
"name": "YARN_CLIENT"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "ATLAS_CLIENT"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "HIVE_CLIENT"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "LOGSEARCH_LOGFEEDER"
},
{
"name": "KAFKA_BROKER"
},
{
"name": "NIFI_MASTER"
},
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "SPARK2_CLIENT"
}
],
"name": "worker"
},
{
"cardinality": "1",
"components": [
{
"name": "ATLAS_SERVER"
},
{
"name": "HBASE_MASTER"
},
{
"name": "METRICS_COLLECTOR"
},
{
"name": "RESOURCEMANAGER"
},
{
"name": "DRUID_HISTORICAL"
},
{
"name": "DRUID_MIDDLEMANAGER"
},
{
"name": "LIVY2_SERVER"
},
{
"name": "SPARK2_JOBHISTORYSERVER"
},
{
"name": "DATANODE"
},
{
"name": "HIVE_CLIENT"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "ATLAS_CLIENT"
},
{
"name": "MAPREDUCE2_CLIENT"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "LOGSEARCH_LOGFEEDER"
},
{
"name": "LOGSEARCH_SERVER"
},
{
"name": "NAMENODE"
},
{
"name": "SUPERSET"
},
{
"name": "NIFI_CA"
},
{
"name": "INFRA_SOLR"
},
{
"name": "METRICS_GRAFANA"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "HBASE_MASTER"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "SPARK2_CLIENT"
}
],
"name": "master2"
},
{
"name": "cdsw_worker",
"cardinality": "1+",
"components": [
{
"name": "SPARK2_CLIENT"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "YARN_CLIENT"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "MAPREDUCE2_CLIENT"
},
{
"name": "HIVE_CLIENT"
},
{
"name": "NODEMANAGER"
},
{
"name": "DATANODE"
},
{
"name": "KAFKA_BROKER"
},
{
"name": "NIFI_MASTER"
},
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "HBASE_REGIONSERVER"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "TEZ_CLIENT"
}
]
}
],
"settings": [
{
"recovery_settings": [
{
"recovery_enabled": "false"
}
]
}
]
} Recipes Pre Ambari start recipe to setup metastores #!/usr/bin/env bash
# Intialize MetaStores
yum install -y https://download.postgresql.org/pub/repos/yum/9.6/redhat/rhel-7-x86_64/pgdg-redhat96-9.6-3.noarch.rpm
yum install -y postgresql96-server
yum install -y postgresql96-contrib
/usr/pgsql-9.6/bin/postgresql96-setup initdb
sed -i 's,#port = 5432,port = 5433,g' /var/lib/pgsql/9.6/data/postgresql.conf
echo '' > /var/lib/pgsql/9.6/data/pg_hba.conf
echo 'local all das,streamsmsgmgr,cloudbreak,registry,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid trust ' >> /var/lib/pgsql/9.6/data/pg_hba.conf
echo 'host all das,streamsmsgmgr,cloudbreak,registry,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid 0.0.0.0/0 trust ' >> /var/lib/pgsql/9.6/data/pg_hba.conf
echo 'host all das,streamsmsgmgr,cloudbreak,registry,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid ::/0 trust ' >> /var/lib/pgsql/9.6/data/pg_hba.conf
echo 'local all all peer ' >> /var/lib/pgsql/9.6/data/pg_hba.conf
echo 'host all all 127.0.0.1/32 ident ' >> /var/lib/pgsql/9.6/data/pg_hba.conf
echo 'host all all ::1/128 ident ' >> /var/lib/pgsql/9.6/data/pg_hba.conf
systemctl enable postgresql-9.6.service
systemctl start postgresql-9.6.service
echo "CREATE DATABASE streamsmsgmgr;" | sudo -u postgres psql -U postgres -h localhost -p 5433
echo "CREATE USER streamsmsgmgr WITH PASSWORD 'streamsmsgmgr';" | sudo -u postgres psql -U postgres -h localhost -p 5433
echo "GRANT ALL PRIVILEGES ON DATABASE streamsmsgmgr TO streamsmsgmgr;" | sudo -u postgres psql -U postgres -h localhost -p 5433
echo "CREATE DATABASE druid;" | sudo -u postgres psql -U postgres
echo "CREATE DATABASE ranger;" | sudo -u postgres psql -U postgres
echo "CREATE DATABASE registry;" | sudo -u postgres psql -U postgres
echo "CREATE USER druid WITH PASSWORD 'druid';" | sudo -u postgres psql -U postgres
echo "CREATE USER registry WITH PASSWORD 'registry';" | sudo -u postgres psql -U postgres
echo "CREATE USER rangerdba WITH PASSWORD 'rangerdba';" | sudo -u postgres psql -U postgres
echo "CREATE USER rangeradmin WITH PASSWORD 'ranger';" | sudo -u postgres psql -U postgres
echo "GRANT ALL PRIVILEGES ON DATABASE druid TO druid;" | sudo -u postgres psql -U postgres
echo "GRANT ALL PRIVILEGES ON DATABASE registry TO registry;" | sudo -u postgres psql -U postgres
echo "GRANT ALL PRIVILEGES ON DATABASE ranger TO rangerdba;" | sudo -u postgres psql -U postgres
echo "GRANT ALL PRIVILEGES ON DATABASE ranger TO rangeradmin;" | sudo -u postgres psql -U postgres
#ambari-server setup --jdbc-db=postgres --jdbc-driver=/usr/share/java/postgresql-jdbc.jar
if [[ $(cat /etc/system-release|grep -Po Amazon) == "Amazon" ]]; then
echo '' > /var/lib/pgsql/9.5/data/pg_hba.conf
echo 'local all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry trust ' >> /var/lib/pgsql/9.5/data/pg_hba.conf
echo 'host all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry 0.0.0.0/0 trust ' >> /var/lib/pgsql/9.5/data/pg_hba.conf
echo 'host all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry ::/0 trust ' >> /var/lib/pgsql/9.5/data/pg_hba.conf
echo 'local all all peer ' >> /var/lib/pgsql/9.5/data/pg_hba.conf
echo 'host all all 127.0.0.1/32 ident ' >> /var/lib/pgsql/9.5/data/pg_hba.conf
echo 'host all all ::1/128 ident ' >> /var/lib/pgsql/9.5/data/pg_hba.conf
sudo -u postgres /usr/pgsql-9.5/bin/pg_ctl -D /var/lib/pgsql/9.5/data/ reload
else
echo '' > /var/lib/pgsql/data/pg_hba.conf
echo 'local all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry trust ' >> /var/lib/pgsql/data/pg_hba.conf
echo 'host all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry 0.0.0.0/0 trust ' >> /var/lib/pgsql/data/pg_hba.conf
echo 'host all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry ::/0 trust ' >> /var/lib/pgsql/data/pg_hba.conf
echo 'local all all peer ' >> /var/lib/pgsql/data/pg_hba.conf
echo 'host all all 127.0.0.1/32 ident ' >> /var/lib/pgsql/data/pg_hba.conf
echo 'host all all ::1/128 ident ' >> /var/lib/pgsql/data/pg_hba.conf
sudo -u postgres pg_ctl -D /var/lib/pgsql/data/ reload
fi
yum remove -y mysql57-community*
yum remove -y mysql56-server*
yum remove -y mysql-community*
rm -Rvf /var/lib/mysql
yum install -y epel-release
yum install -y libffi-devel.x86_64
ln -s /usr/lib64/libffi.so.6 /usr/lib64/libffi.so.5
yum install -y mysql-connector-java*
ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar
if [ $(cat /etc/system-release|grep -Po Amazon) == Amazon ]; then
yum install -y mysql56-server
service mysqld start
else
yum localinstall -y https://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
yum install -y mysql-community-server
systemctl start mysqld.service
fi
chkconfig --add mysqld
chkconfig mysqld on
ln -s /usr/share/java/mysql-connector-java.jar /usr/hdp/current/hive-client/lib/mysql-connector-java.jar
ln -s /usr/share/java/mysql-connector-java.jar /usr/hdp/current/hive-server2-hive2/lib/mysql-connector-java.jar
mysql --execute="CREATE DATABASE druid DEFAULT CHARACTER SET utf8"
mysql --execute="CREATE DATABASE registry DEFAULT CHARACTER SET utf8"
mysql --execute="CREATE DATABASE streamline DEFAULT CHARACTER SET utf8"
mysql --execute="CREATE DATABASE streamsmsgmgr DEFAULT CHARACTER SET utf8"
mysql --execute="CREATE USER 'das'@'localhost' IDENTIFIED BY 'dasuser'"
mysql --execute="CREATE USER 'das'@'%' IDENTIFIED BY 'dasuser'"
mysql --execute="CREATE USER 'ranger'@'localhost' IDENTIFIED BY 'ranger'"
mysql --execute="CREATE USER 'ranger'@'%' IDENTIFIED BY 'ranger'"
mysql --execute="CREATE USER 'rangerdba'@'localhost' IDENTIFIED BY 'rangerdba'"
mysql --execute="CREATE USER 'rangerdba'@'%' IDENTIFIED BY 'rangerdba'"
mysql --execute="CREATE USER 'registry'@'localhost' IDENTIFIED BY 'registry'"
mysql --execute="CREATE USER 'registry'@'%' IDENTIFIED BY 'registry'"
mysql --execute="CREATE USER 'streamsmsgmgr'@'localhost' IDENTIFIED BY 'streamsmsgmgr'"
mysql --execute="CREATE USER 'streamsmsgmgr'@'%' IDENTIFIED BY 'streamsmsgmgr'"
mysql --execute="CREATE USER 'druid'@'%' IDENTIFIED BY 'druid'"
mysql --execute="CREATE USER 'streamline'@'%' IDENTIFIED BY 'streamline'"
mysql --execute="CREATE USER 'streamline'@'localhost' IDENTIFIED BY 'streamline'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'das'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'das'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'das'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'das'@'%' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'%' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'rangerdba'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'rangerdba'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'rangerdba'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'rangerdba'@'%' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON druid.* TO 'druid'@'%' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'registry'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'registry'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'registry'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'registry'@'%' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'streamsmsgmgr'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'streamsmsgmgr'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'streamsmsgmgr'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'streamsmsgmgr'@'%' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON streamline.* TO 'streamline'@'%' WITH GRANT OPTION"
mysql --execute="CREATE DATABASE beast_mode_db DEFAULT CHARACTER SET utf8"
mysql --execute="CREATE USER 'bmq_user'@'localhost' IDENTIFIED BY 'Be@stM0de'"
mysql --execute="CREATE USER 'bmq_user'@'%' IDENTIFIED BY 'Be@stM0de'"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'%' WITH GRANT OPTION"
mysql --execute="FLUSH PRIVILEGES"
mysql --execute="COMMIT"
#remount tmpfs to ensure NOEXEC is disabled
if grep -Eq '^[^ ]+ /tmp [^ ]+ ([^ ]*,)?noexec[, ]' /proc/mounts; then
echo "/tmp found as noexec, remounting..."
mount -o remount,size=10G /tmp
mount -o remount,exec /tmp
else
echo "/tmp not found as noexec, skipping..."
fi Pre Ambari start recipe to grow the root volume for the CDSW worker #!/usr/bin/env bash
# WARNING: This script is only for RHEL7 on Azure
# growing the /dev/sda2 partition
sed -e 's/\s*\([\+0-9a-zA-Z]*\).*/\1/' << EOF | fdisk /dev/sda
d # delete
2 # delete partition 2
n # new
p # partition
2 # partition 2
# default
# default
w # write the partition table
q # and we're done
EOF
reboot Post cluster install recipe to setup CDSW #!/usr/bin/env bash
# WARNING: This script is only for RHEL7 on Azure
# growing the /dev/sda2 partition
xfs_growfs /dev/sda2
# Some of these installs may be unecessary but are included for completeness against documentation
yum -y install nfs-utils libseccomp lvm2 bridge-utils libtool-ltdl ebtables rsync policycoreutils-python ntp bind-utils nmap-ncat openssl e2fsprogs redhat-lsb-core socat selinux-policy-base selinux-policy-targeted
# CDSW wants a pristine IPTables setup
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
iptables -t nat -F
iptables -t mangle -F
iptables -F
iptables -X
# set java_home on centos7
#echo 'export JAVA_HOME=$(readlink -f /usr/bin/javac | sed "s:/bin/javac::")' >> /etc/profile
#export JAVA_HOME=$(readlink -f /usr/bin/javac | sed "s:/bin/javac::")
echo 'export JAVA_HOME=/usr/lib/jvm/java' >> /etc/profile
export JAVA_HOME='/usr/lib/jvm/java'
# Fetch public IP
export MASTER_IP=$(hostname --ip-address)
# Fetch public FQDN for Domain
export DOMAIN=$(curl https://ipv4.icanhazip.com)
cd /hadoopfs/
mkdir cdsw
# Install CDSW
#wget -q --no-check-certificate https://s3.eu-west-2.amazonaws.com/whoville/v2/temp.blob
#mv temp.blob cloudera-data-science-workbench-1.5.0.818361-1.el7.centos.x86_64.rpm
wget -q https://archive.cloudera.com/cdsw1/1.5.0/redhat7/yum/RPMS/x86_64/cloudera-data-science-workbench-1.5.0.849870-1.el7.centos.x86_64.rpm
yum install -y cloudera-data-science-workbench-1.5.0.849870-1.el7.centos.x86_64.rpm
# Install Anaconda
curl -Ok https://repo.anaconda.com/archive/Anaconda2-5.2.0-Linux-x86_64.sh
chmod +x ./Anaconda2-5.2.0-Linux-x86_64.sh
./Anaconda2-5.2.0-Linux-x86_64.sh -b -p /anaconda
# create unix user
useradd tutorial
echo "tutorial-password" | passwd --stdin tutorial
su - hdfs -c 'hdfs dfs -mkdir /user/tutorial'
su - hdfs -c 'hdfs dfs -chown tutorial:hdfs /user/tutorial'
# CDSW Setup
sed -i "s@MASTER_IP=\"\"@MASTER_IP=\"${MASTER_IP}\"@g" /etc/cdsw/config/cdsw.conf
sed -i "s@JAVA_HOME=\"/usr/java/default\"@JAVA_HOME=\"$(echo ${JAVA_HOME})\"@g" /etc/cdsw/config/cdsw.conf
sed -i "s@DOMAIN=\"cdsw.company.com\"@DOMAIN=\"${DOMAIN}.xip.io\"@g" /etc/cdsw/config/cdsw.conf
sed -i "s@DOCKER_BLOCK_DEVICES=\"\"@DOCKER_BLOCK_DEVICES=\"${DOCKER_BLOCK}\"@g" /etc/cdsw/config/cdsw.conf
sed -i "s@APPLICATION_BLOCK_DEVICE=\"\"@APPLICATION_BLOCK_DEVICE=\"${APP_BLOCK}\"@g" /etc/cdsw/config/cdsw.conf
sed -i "s@DISTRO=\"\"@DISTRO=\"HDP\"@g" /etc/cdsw/config/cdsw.conf
sed -i "s@ANACONDA_DIR=\"\"@ANACONDA_DIR=\"/anaconda/bin\"@g" /etc/cdsw/config/cdsw.conf
# CDSW will break default Amazon DNS on 127.0.0.1:53, so we use a different IP
sed -i "s@nameserver 127.0.0.1@nameserver 169.254.169.253@g" /etc/dhcp/dhclient-enter-hooks
cdsw init
echo "CDSW will shortly be available on ${DOMAIN}"
# after the init, we wait until we are able to create the tutorial user
export respCode=404
while (( $respCode != 201 ))
do
sleep 10
export respCode=$(curl -iX POST http://${DOMAIN}.xip.io/api/v1/users/ -H 'Content-Type: application/json' -d '{"email":"tutorial@tutorial.com","name":"tutorial","username":"tutorial","password":"tutorial-password","type":"user","admin":true}' | grep HTTP | awk '{print $2}')
done
exit 0 Note: this script is using xip.io and hacks into unix to create user and hadoop folders, not a recommendation in production! Management packs IMAGE You will need two management packs for this setup, using the URL detailed below: HDF mpack: http://s3.amazonaws.com/dev.hortonworks.com/HDF/centos7/3.x/BUILDS/3.3.1.0-10/tars/hdf_ambari_mp/hdf-ambari-mpack-3.3.1.0-10.tar.gz Search mpack: http://public-repo-1.hortonworks.com/HDP-SOLR/hdp-solr-ambari-mp/solr-service-mpack-4.0.0.tar.gz Step 3: Create cluster This step uses Cloudbreak's Create Cluster wizard, and is pretty self-explanatory following screenshots, but I will add specific parameters in text form for convenience Note: Do not forget to toggle the advanced mode when running the wizard (top of the screen) General Configuration Image Settings Hardware and Storage Note: Make sure to use 100 GB as the root volume size for CDSW. Network and Availability Cloud Storage Cluster Extensions External Sources Gateway Configuration Network Security Groups Security Result After the cluster created, you should have access to the following screen in Cloudbreak: You can now access Ambari via the link provided, and CDSW using http://[CDSW_WORKER_PUBLIC_IP].xip.io
... View more
06-13-2019
09:41 AM
from where to get both the imported files?
... View more
02-13-2019
09:24 PM
4 Kudos
I have been playing quite a bit with CDSW lately. Here is a quick article on how to setup a CDSW project in scala connecting to an external RDBMS Step 1: Create a new CDSW Project Using the CDSW UI, create a new Scala Project: Step 2: Reference the external Jar in your spark-defaults.conf Open your project, and edit your spark-defaults.conf to add an external jar: spark.jars=http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.6/mysql-connector-java-5.1.6.jar Step 3: Create a simple Scala file to connect to the DB Create a new file and add this code in it: val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://[YOUR_SERVER_IP]:3306/[YOUR_DB]").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "[YOUR_TABLE]").option("user", "[YOUR_USER]").option("password", "[YOUR_PWD]").load()
df.show() Step 4: Run your application Launch a session and run your code:
... View more
01-31-2019
06:16 PM
4 Kudos
Introduction
Finally, we get to put all this together! This is less of a tutorial, and more of a show case of the capabilities of this solution. Please refere to previous articles to build this on your own.
The meat of this article is showcasing the React/NodeJS application you can build to communicate with Nifi, Cloudbreak and your ephemeral cluster running Zeppelin and Spark.
Architecture
Below is a high-level architecture of the solution:
You can find all the code for this here: https://github.com/paulvid/bmq-app
The code includes:
Latest Zeppelin Prediction Notebook
All Nifi templates
All Blueprints/Recipes for Cloudbreak
The source code of the app
Agenda
This article is divided in the following sections:
Section 1: Monitoring your current data with Nifi
Section 2: Launching an ephemeral cluster with Cloudbreak
Section 3: Running a prediction model with Spark Zeppelin
Section 1: Monitoring your current data with Nifi
Section 2: Launching an ephemeral cluster with Cloudbreak
Section 3: Running a prediction model with Spark Zeppelin
... View more
01-31-2019
05:13 PM
3 Kudos
Introduction
It's been a while, but I'm finally finishing this series on Best Mode Quotient.
This article is fairly straight forward as we look how to build the infrastructure of model training using Zeppelin and Spark. Pre-Requisites
In order to run this install, you will have to have deployed a BMQ ephemeral cluster as detailed in this article and this repository.
Moreover you will have to have gathered data on your fitness as explained in part 1 of these series.
Based on this data, I created 3 indexes:
Intensity Index: How intense was the workout I did that day (based on distance, pace and elevation)
Fatigue Index: How much rest I had on that day (based on hours of sleep and rest heart rate)
BMQ Index: How my BMQ fairs compare to the max BMQ (5)
You can then agglomerate these 3 indexes, having BMQ and Fatigue on the same day correlating to the Intensity Index of the previous day. All data will be shared in the last article of the series.
You will also notice that the model uses parameterized sleep and rest HR. The whole flow is to be revealed in part 4 🙂 Agenda
This tutorial is divided in the following sections:
Section 1: Create a mysql interpreter for JDBC in Zeppelin
Section 2: Create training set for BMQ prediction
Section 3: Create a prediction model
Section 4: Save the results in a table Section 1: Create a mysql interpreter for JDBC in Zeppelin
Login to Zeppelin, then go to top right corner > Admin > Interpreter and edit the jdbc interpreter. Add the following parameters:
mysql.driver: com.mysql.jdbc.Driver
mysql.url: jdbc:mysql://localhost:3306/beast_mode_db
mysql.user: bmq_user
mysql.password: Be@stM0de
Add artifact: mysql:mysql-connector-java:5.1.38
Restart the interpreter and you should be good to go. Section 2: Create training set for BMQ prediction Create a new note called BMQ Predictions and add the following code, using the jdbc interpreter you just built. Delete existing training tables if any %jdbc(mysql)
drop table if exists training_set Create training set based on fatigue and intensity indexes %jdbc(mysql)
create table training_set as (
select @rowid:=@rowid+1 as rowid, bmq_index.date, bmq_index, fatigue_index, intensity_index
from bmq_index, fatigue_index, intensity_index, (select @rowid:=0) as init
where bmq_index.date = fatigue_index.date
and date_sub(bmq_index.date, INTERVAL 1 DAY) = intensity_index.date
order by bmq_index.date asc) View Data %jdbc(mysql)
select * from training_set Delete existing prediction tables if any %jdbc(mysql)
drop table if exists prediction Create a table we want to apply the algo against
%jdbc(mysql)
create table prediction as (
select date(training_date) as date, estimated_intensity_index,
round((
(1-((select (sleep_hours*60) from PREDICTION_PARAMETERS)/(select max(TOTAL_MINUTES_ASLEEP) from SLEEP_HISTORY)))*0.6 +
(1-((select min(REST_HR) from HEALTH_HISTORY)/(select rest_hr from PREDICTION_PARAMETERS)))*0.4
) *100,2) as estimated_fatigue_index,
0.0 as predicted_bmq
from training_plan) View Data %jdbc(mysql)
select * from prediction Section 3: Create a prediction model DISCLAIMER: This model needs to be worked on; the purpose of this article is to establish the principal architecture, not give the final most tuned model as I plan on improving on it. This part uses the spark interpreter of Zeppelin to vectorize, normalize and train a model Create dataframe from MySQL tables: %spark2
import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.feature.Interaction
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.Normalizer
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/beast_mode_db").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "training_set").option("user", "bmq_user").option("password", "Be@stM0de").load()
df.show()
%spark2
val target_df = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/beast_mode_db").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "prediction").option("user", "bmq_user").option("password", "Be@stM0de").load()
target_df.show() Vectorize Dataframes
val assembler1 = new VectorAssembler().
setInputCols(Array( "fatigue_index","intensity_index")).
setOutputCol("features").
transform(df)
assembler1.show() %spark2
val assembler2 = new VectorAssembler().
setInputCols(Array( "estimated_fatigue_index","estimated_intensity_index")).
setOutputCol("features").
transform(target_df)
assembler2.show() Normalize Dataframes %spark2
val normalizer = new Normalizer()
.setInputCol("features")
.setOutputCol("normFeatures")
.setP(2.0)
.transform(assembler1)
normalizer.show()
%spark2
val targetNormalizer = new Normalizer()
.setInputCol("features")
.setOutputCol("normFeatures")
.setP(2.0)
.transform(assembler2)
targetNormalizer.show() Train and evaluate Model %spark2
val Array(trainingData, testData) = normalizer.randomSplit(Array(0.7, 0.3))
%spark2
val lr = new LinearRegression()
.setLabelCol("bmq_index")
.setFeaturesCol("normFeatures")
.setMaxIter(10)
.setRegParam(1.0)
.setElasticNetParam(1.0)
val lrModel = lr.fit(trainingData)
lrModel.transform(testData).select("features","normFeatures", "bmq_index", "prediction").show() %spark2
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") %spark2
val trainingSummary = lrModel.summary
println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}") %spark2
val targetTable = lrModel.transform(targetNormalizer).select("date", "estimated_intensity_index", "estimated_fatigue_index", "prediction")
targetTable.show()
Section 4: Save the results in a table Finally, we will take the results of the prediction and put them in a table called BMQ_PREDICTIONS for later use: Rename the dataframe to match target columns name %spark2
val newNames = Seq("date", "estimated_intensity_index", "estimated_fatigue_index", "predicted_bmq")
val targetTableRenamed = targetTable.toDF(newNames: _*)
Delete target table if exists %jdbc(mysql)
drop table if exists BMQ_PREDICTIONS Write data %spark2
val prop = new java.util.Properties
prop.setProperty("driver", "com.mysql.jdbc.Driver")
prop.setProperty("user", "bmq_user")
prop.setProperty("password", "Be@stM0de")
targetTableRenamed.write.mode("append").jdbc("jdbc:mysql://localhost:3306/beast_mode_db", "BMQ_PREDICTIONS", prop) View data %jdbc(mysql)
select * from BMQ_PREDICTIONS
... View more
01-17-2019
06:10 PM
1 Kudo
Here is a self explanatory tutorial. The official Cloudbreak documentation explains how to retrieve your Cloudbreak token via curl or python (see here). I have been playing with Node.JS on my side, so here is how to do it in Node.JS using request: process.env["NODE_TLS_REJECT_UNAUTHORIZED"] = 0;
var request = require("request");
var options = { method: 'POST',
url: 'https://[YOUR_CB_URL]/identity/oauth/authorize',
qs:
{ response_type: 'token',
client_id: 'cloudbreak_shell',
'scope.0': 'openid',
source: 'login',
redirect_uri: 'http://cloudbreak.shell',
accept: 'application/x-www-form-urlencoded' },
headers:
{ 'Content-Type': 'application/x-www-form-urlencoded',
accept: 'application/x-www-form-urlencoded' },
body: 'credentials={"username":"[YOUR_USER]","password":"[YOUR_PASSWORD]"' };
request(options, function (error, response, body) {
if (error) throw new Error(error);
const querystring = require('querystring');
console.log(querystring.parse(response.headers['location'])['access_token']);
});
... View more
Labels:
01-04-2019
03:18 PM
3 Kudos
DISCLAIMER: This is a method that is not recommended for any production use, only for development purposes. Only Cloudbreak official versions are supported, and upgrading to a non-supported version could cause unforeseen issues, and loss of official Hortonworks support. Moreover, this method only works for upgrades, not downgrades; downgrades would result in data loss.
Recently, I have been playing with Cloudbreak and have been needing to upgrade my local version. While it follows the principles detailed in Hortonworks documentation, I thought I'd share a quick step by step guide on how to upgrade to any version of Cloudbreak.
1. Stop Cloudbreak on your machine:
cbd kill
1. Go to https://mvnrepository.com/artifact/com.sequenceiq/cloudbreak and find the version you need (e.g. 2.8.1-rc.48)
2. On the VM where Cloudbreak is running, navigate to the directory where your Profile file is located. For example:
cd /var/lib/cloudbreak-deployment/
2. Run the following commands to download the binary:
export CBD_VERSION=2.8.1-rc.48
curl -Ls public-repo-1.hortonworks.com/HDP/cloudbreak/cloudbreak-deployer_${CBD_VERSION}_$(uname)_x86_64.tgz | tar -xz -C /bin cbd
3. Verify the version:
cbd version
4. Regenerate assets:
cbd regenerate
5. Restart Cloudbreak:
cbd start
... View more
Labels:
12-21-2018
03:07 PM
3 Kudos
Introduction
Continuing my series on Beast Mode Quotient, let's automate the creation and termination of data science ready clusters. As always, since this is step 2 of a series of article, this tutorial depends on my
previous article.
Pre-Requisites
In order to run this tutorial you will need to have a CB instance available. There are plenty of good tutorial out there, I recommend
this one
Agenda
This tutorial is divided in the following sections:
Section 1: Create a blueprints & recipes to run a minimal data science ephemeral cluster
Section 2: Add blueprint and recipes via Cloudbreak interface
Section 3: Automate cluster launch and terminate clusters
Section 1: Create a blueprints & recipes to run a minimal data science ephemeral cluster
A Cloudbreak blueprint has 3 parts:
Part 1: Blueprint details
Part 2: Services Configuration
Part 3: Host Components configuration
Here are the details of each part for our blueprint.
Blueprint details
This part is fairly simple; we want to run an HDP 3.1 cluster, and I'm naming it bmq-data-science.
"Blueprints": {
"blueprint_name": "bmq-data-science",
"stack_name": "HDP",
"stack_version": "3.1"
}
Host Components configuration
Similarly, I configured one Host with all the components needed for my services
"host_groups": [
{
"name": "master",
"cardinality": "1",
"components": [
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "NAMENODE"
},
{
"name": "SECONDARY_NAMENODE"
},
{
"name": "RESOURCEMANAGER"
},
{
"name": "HISTORYSERVER"
},
{
"name": "APP_TIMELINE_SERVER"
},
{
"name": "LIVY2_SERVER"
},
{
"name": "SPARK2_CLIENT"
},
{
"name": "SPARK2_JOBHISTORYSERVER"
},
{
"name": "ZEPPELIN_MASTER"
},
{
"name": "METRICS_GRAFANA"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "DATANODE"
},
{
"name": "HIVE_SERVER"
},
{
"name": "HIVE_METASTORE"
},
{
"name": "HIVE_CLIENT"
},
{
"name": "YARN_CLIENT"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "NODEMANAGER"
},
{
"name": "MAPREDUCE2_CLIENT"
}
]
}
]
Services Configuration
For our purposes, I want to create a minimum cluster that will run YARN, HDFS, HIVE, SPARK and ZEPPELIN (plus all the necessary compute engines behind it). I therefore configured these services according to the Cloudbreak examples that are available in the default Cloudbreak blueprints:
"configurations": [
{
"yarn-site": {
"properties": {
"yarn.nodemanager.resource.cpu-vcores": "6",
"yarn.nodemanager.resource.memory-mb": "23296",
"yarn.scheduler.maximum-allocation-mb": "23296"
}
}
},
{
"core-site": {
"properties_attributes": {},
"properties": {
"fs.s3a.threads.max": "1000",
"fs.s3a.threads.core": "500",
"fs.s3a.max.total.tasks": "1000",
"fs.s3a.connection.maximum": "1500"
}
}
},
{
"capacity-scheduler": {
"properties": {
"yarn.scheduler.capacity.root.queues": "default",
"yarn.scheduler.capacity.root.capacity": "100",
"yarn.scheduler.capacity.root.maximum-capacity": "100",
"yarn.scheduler.capacity.root.default.capacity": "100",
"yarn.scheduler.capacity.root.default.maximum-capacity": "100"
}
}
},
{
"spark2-defaults": {
"properties_attributes": {},
"properties": {
"spark.sql.hive.hiveserver2.jdbc.url": "jdbc:hive2://%HOSTGROUP::master%:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2",
"spark.sql.hive.hiveserver2.jdbc.url.principal": "hive/_HOST@EC2.INTERNAL",
"spark.datasource.hive.warehouse.metastoreUri": "thrift://%HOSTGROUP::master%:9083",
"spark.datasource.hive.warehouse.load.staging.dir": "/tmp",
"spark.hadoop.hive.zookeeper.quorum": "%HOSTGROUP::master%:2181"
}
}
},
{
"hive-site": {
"hive.metastore.warehouse.dir": "/apps/hive/warehouse",
"hive.exec.compress.output": "true",
"hive.merge.mapfiles": "true",
"hive.server2.tez.initialize.default.sessions": "true",
"hive.server2.transport.mode": "http",
"hive.metastore.dlm.events": "true",
"hive.metastore.transactional.event.listeners": "org.apache.hive.hcatalog.listener.DbNotificationListener",
"hive.repl.cm.enabled": "true",
"hive.repl.cmrootdir": "/apps/hive/cmroot",
"hive.repl.rootdir": "/apps/hive/repl"
}
},
{
"hdfs-site": {
"properties_attributes": {},
"properties": {
}
}
}
]
You can find the complete blueprint and other recipes on my github, here
Creating a recipe
Recipes in Cloubreak are very useful and allow you to run scripts before a cluster is launched or after. In this example, I created a PRE-AMBARI-START recipe that creates the appropriate Postgres and MySQL services on my master box, as well as recreating the DB for my BMQ analysis:
#!/bin/bash
# Cloudbreak-2.7.2 / Ambari-2.7.0 - something is install pgsq95
yum remove -y postgresql95*
# Install pgsql96
yum install -y https://download.postgresql.org/pub/repos/yum/9.6/redhat/rhel-7-x86_64/pgdg-redhat96-9.6-3.noarch.rpm
yum install -y postgresql96-server
yum install -y postgresql96-contrib
/usr/pgsql-9.6/bin/postgresql96-setup initdb
sed -i 's,#port = 5432,port = 5433,g' /var/lib/pgsql/9.6/data/postgresql.conf
systemctl enable postgresql-9.6.service
systemctl start postgresql-9.6.service
yum remove -y mysql57-community*
yum remove -y mysql56-server*
yum remove -y mysql-community*
rm -Rvf /var/lib/mysql
yum install -y epel-release
yum install -y libffi-devel.x86_64
ln -s /usr/lib64/libffi.so.6 /usr/lib64/libffi.so.5
yum install -y mysql-connector-java*
ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar
if [ $(cat /etc/system-release|grep -Po Amazon) == Amazon ]; then
yum install -y mysql56-server
service mysqld start
else
yum localinstall -y https://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
yum install -y mysql-community-server
systemctl start mysqld.service
fi
chkconfig --add mysqld
chkconfig mysqld on
ln -s /usr/share/java/mysql-connector-java.jar /usr/hdp/current/hive-client/lib/mysql-connector-java.jar
ln -s /usr/share/java/mysql-connector-java.jar /usr/hdp/current/hive-server2-hive2/lib/mysql-connector-java.jar
mysql --execute="CREATE DATABASE beast_mode_db DEFAULT CHARACTER SET utf8"
mysql --execute="CREATE USER 'bmq_user'@'localhost' IDENTIFIED BY 'Be@stM0de'"
mysql --execute="CREATE USER 'bmq_user'@'%' IDENTIFIED BY 'Be@stM0de'"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'localhost'"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'%'"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'localhost' WITH GRANT OPTION"
mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'%' WITH GRANT OPTION"
mysql --execute="FLUSH PRIVILEGES"
mysql --execute="COMMIT"
Section 2: Add blueprint and recipes via Cloudbreak interface
This part is super simple. Follow the User interface to load the files you just created, as depicted below
Adding a blueprint
Adding a recipe
Section 3: Automate cluster launch and terminate clusters
This is where the fun begins. For this part I created two scripts for launching and terminating the ephemeral cluster (that will then be called by the BMQ app). Both scripts rely on the cb cli that you can download from your CB instance:
Launch Cluster Script
As you will see below, the script is divided in the following part:
Part 1: Reference the location of the cb-cli for ease of use
Part 2: Dumps the content of my long lasting cluster to a recipe that will load them
Part 3: Use CB api to add the recipe to my CB instance
Part 4: Launch the cluster via cb-cli
#!/bin/bash
###############################
# 0. Initializing environment #
###############################
export PATH=$PATH:/Users/pvidal/Documents/Playground/cb-cli/
###################################################
# 1. Dumping current data and adding it to recipe #
###################################################
rm -rf poci-bmq-data-science.sh >/dev/null 2>&1
echo "mysql -u bmq_user -pBe@stM0de beast_mode_db --execute=\"""$(mysqldump -u bm_user -pHWseftw33# beast_mode_db 2> /dev/null)""\"" >> poci-bmq-data-science.sh
##################################
# 2. Adding recipe to cloudbreak #
##################################
TOKEN=$(curl -k -iX POST -H "accept: application/x-www-form-urlencoded" -d 'credentials={"username":"pvidal@hortonworks.com","password":"HWseftw33#"}' "https://192.168.56.100/identity/oauth/authorize?response_type=token&client_id=cloudbreak_shell&scope.0=openid&source=login&redirect_uri=http://cloudbreak.shell" | grep location | cut -d'=' -f 3 | cut -d'&' -f 1)
echo $TOKEN
ENCODED_RECIPE=$(base64 poci-bmq-data-science.sh)
curl -X DELETE https://192.168.56.100/cb/api/v1/recipes/user/poci-bmq-data-science -H "Authorization: Bearer $TOKEN" -k
curl -X POST https://192.168.56.100/cb/api/v1/recipes/user -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -H 'cache-control: no-cache' -d " {
\"name\": \"poci-bmq-data-science\",
\"description\": \"Recipe loading BMQ data post BMQ cluster launch\",
\"recipeType\": \"POST_CLUSTER_INSTALL\",
\"content\": \"$POST_CLUSTER_INSTALL\"
}" -k
########################
# 3. Launching cluster #
########################
cb cluster create --cli-input-json tp-bmq-data-science.json --name bmq-data-science-$(date +%s)
Terminate Cluster Script
This script is much simpler; it uses cb cli to list the clusters running and terminate them:
#!/bin/bash
###############################
# 0. Initializing environment #
###############################
export PATH=$PATH:/Users/pvidal/Documents/Playground/cb-cli/
################################################
# 1. Get a list of clusters and terminate them #
################################################
cb cluster list | grep Name | awk -F \" '{print $4}' | while read cluster; do
echo "Terminating ""$cluster""..."
cb cluster delete --name $cluster
done
Conclusion
With this framework, I'm able to launch and terminate clusters in a matter of minutes, as depicted below. Next step will be to model a way to calculate accurate prediction for BMQ!
... View more
12-10-2018
03:07 PM
11 Kudos
Introduction
Cloudbreak documentation is fairly thorough and can be found on Hortonworks' documentation portal (e.g.
here). However, if you are as lazy as I am, you'll appreciate the few quick examples I'm listing in this article.
First and foremost, you can access the documentation of available APIs on your CB instance by using the following URL
https://[YOUR_CB_URL]/cb/apidocs ; you should see something like this:
Authenticating
Authentication in cloudbreak is OAuth 2, and getting a detailed in the Cloudbreak API.
Here is simple curl call to get a token:
curl -k -iX POST -H "accept: application/x-www-form-urlencoded" -d 'credentials={"username":"[YOUR_USER]","password":"[YOUR_PWD]"}' "https://[YOUR_CB_URL]/identity/oauth/authorize?response_type=token&client_id=cloudbreak_shell&scope.0=openid&source=login&redirect_uri=http://cloudbreak.shell" | grep location | cut -d'=' -f 3 | cut -d'&' -f 1
GET Example: Listing all blueprints
The biggest trick to the CB API is to know the URL to use and to remember to disable SSL Certificate Verification (using the -k in curl).
Here is an example call that lists all blueprints:
TOKEN=$(curl -k -iX POST -H "accept: application/x-www-form-urlencoded" -d 'credentials={"username":"[YOUR_USER]","password":"[YOUR_PWD]"}' "https://[YOUR_CB_URL]/identity/oauth/authorize?response_type=token&client_id=cloudbreak_shell&scope.0=openid&source=login&redirect_uri=http://cloudbreak.shell" | grep location | cut -d'=' -f 3 | cut -d'&' -f 1)
curl -X GET https://YOUR_CB_URL]/cb/api/v1/blueprints/account -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -H 'cache-control: no-cache' -k
POST Example: Adding an mpack
TOKEN=$(curl -k -iX POST -H "accept: application/x-www-form-urlencoded" -d 'credentials={"username":"[YOUR_USER]","password":"[YOUR_PWD]"}' "https://[YOUR_CB_URL]/identity/oauth/authorize?response_type=token&client_id=cloudbreak_shell&scope.0=openid&source=login&redirect_uri=http://cloudbreak.shell" | grep location | cut -d'=' -f 3 | cut -d'&' -f 1)
curl -X POST https://[YOUR_CB_URL]/cb/api/v1/mpacks/account -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -H 'cache-control: no-cache' -d '{
"name": "hdf-3-2-aws",
"description": "HDF 3.2 Mangement Pack for AWS",
"mpackUrl": "http://public-repo-1.hortonworks.com/HDF/amazonlinux2/3.x/updates/3.2.0.0/tars/hdf_ambari_mp/hdf-ambari-mpack-3.2.0.0-520.tar.gz",
"purge": false,
"force": false
}' -k
POST Example: Uploading a user blueprint
To upload your own blueprints, remember to encode your blueprint in base64
TOKEN=$(curl -k -iX POST -H "accept: application/x-www-form-urlencoded" -d 'credentials={"username":"[YOUR_USER]","password":"[YOUR_PWD]"}' "https://[YOUR_CB_URL]/identity/oauth/authorize?response_type=token&client_id=cloudbreak_shell&scope.0=openid&source=login&redirect_uri=http://cloudbreak.shell" | grep location | cut -d'=' -f 3 | cut -d'&' -f 1)
ENCODED_BLUEPRINT=$(base64 [YOUR_BP_JSON_LOCATION])
curl -X POST https://[YOUR_CB_URL]/cb/api/v1/blueprints/user -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -H 'cache-control: no-cache' -d " {
\"ambariBlueprint\": \"$ENCODED_BLUEPRINT\",
\"description\": \"Blueprint for HWX Data Science Uploaded\",
\"inputs\": [],
\"tags\": {},
\"name\": \"data-science-workshop-upload\",
\"hostGroupCount\": 1,
\"status\": \"USER_MANAGED\",
\"public\": true
}" -k
... View more
Labels:
11-07-2018
02:58 PM
2 Kudos
Introduction
Before being able to predict one's Beast Mode Quotient, we must find an easy way to ingest data from multiple different sources, in this case flat files and APIs. Luckily for everyone in the world, HDF, specifically Nifi is designed to handle multiple data inputs and outputs extremely efficiently, as I will detail in this tutorial.
A few notes about the choices of technologies selected for this tutorial and series of article:
Target - MySQL: I selected MySQL because this is what my BMQ app is going to use, but it could be any target depending on your use-case.
Sources - Fitbit, Strava, MyFitnessPal: There are a bunch of options out there to track health. Fitbit and Strava are the only ones offering a comprehensive and public API (looking at you Garmin, and Apple Health; both have no public APIs, and make it not easy to extract data via file, either having to parse .fit files or only exporting the full set of data); as for MyFitnessPal, I could have used Fitbit API (after sync with MFP) to get Calories data, but I figured it would be nice to demonstrate file ingestion. Notions you will learn in this tutorial
Here are some high level concepts highlighted in this article, that you can re-apply to any implementation:
Running MySQL queries including create, upsert and selection of date series
Using Nifi processors maintaining state
Configuring schema registry for file format conversion in Nifi (e.g CSV > JSON)
Using Nifi to call OAuth 2 APIs (using refresh token, bearer tokens, etc.)
Implementing and using a distributed map cache in Nifi
Parsing and splitting JSONs in Nifi
Implementing simple math operations using Nifi language Agenda
This tutorial is divided in the following sections:
Section 1: Setting up a MySQL DB to host health and fitness data
Section 2: Creating a Nifi flow to consume MyFitnessPal CSV exports
Section 3: Creating a Nifi flow to consume Fitbit health & sleep data
Section 4: Creating a Nifi flow to consume Strava activity data Section 1: Setting up a MySQL DB to host health and fitness data Step 1: Create the MySQL DB and users
Before being able to do anything, you will have to create a MySQL instance. There are a bunch of tutorials out there that explain you how to do it, but here is one for
centos 7 for instance.
Once you have this data setup connect as
root and run the following database creation script:
DROP DATABASE IF EXISTS `BEAST_MODE_DB`;
CREATE DATABASE `BEAST_MODE_DB`;
CREATE USER 'bm_user'@'%' IDENTIFIED BY '[YOUR_PASSWORD]';
GRANT ALL ON BEAST_MODE_DB.* TO 'bm_user'@'%' IDENTIFIED BY '[YOUR_PASSWORD]';
Step 2: Create the DB tables
Connect as bm_user and run this script to create the appropriate tables:
-- MySQL dump 10.14 Distrib 5.5.60-MariaDB, for Linux (x86_64)
--
-- Host: localhost Database: BEAST_MODE_DB
-- ------------------------------------------------------
-- Server version 5.5.60-MariaDB
/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */;
/*!40103 SET TIME_ZONE='+00:00' */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
USE BEAST_MODE_DB;
--
-- Table structure for table `ACTIVITY_HISTORY`
--
DROP TABLE IF EXISTS `ACTIVITY_HISTORY`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `ACTIVITY_HISTORY` (
`ID` bigint(20) NOT NULL,
`START_TIME` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`DURATION` double DEFAULT NULL,
`DISTANCE` double DEFAULT NULL,
`ELEVATION_GAIN` double DEFAULT NULL,
`AVG_HR` double DEFAULT NULL,
`AVG_PACE` double DEFAULT NULL,
PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;
--
-- Table structure for table `BMQ_HISTORY`
--
DROP TABLE IF EXISTS `BMQ_HISTORY`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `BMQ_HISTORY` (
`ID` mediumint(9) NOT NULL AUTO_INCREMENT,
`BMQ` int(11) NOT NULL,
`TYPE` char(30) DEFAULT NULL,
`TIME_ENTERED` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=30 DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;
--
-- Table structure for table `GENERATOR`
--
DROP TABLE IF EXISTS `GENERATOR`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `GENERATOR` (
`ID` mediumint(9) NOT NULL AUTO_INCREMENT,
`VALUE` varchar(50) DEFAULT NULL,
PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=161 DEFAULT CHARSET=latin1;
/*!40101 SET character_set_client = @saved_cs_client */;
--
-- Dumping data for table `GENERATOR`
--
LOCK TABLES `GENERATOR` WRITE;
/*!40000 ALTER TABLE `GENERATOR` DISABLE KEYS */;
INSERT INTO `GENERATOR` VALUES (1,'dummy'),(2,'dummy'),(3,'dummy'),(4,'dummy'),(5,'dummy'),(6,'dummy'),(7,'dummy'),(8,'dummy'),(9,'dummy'),(10,'dummy'),(11,'dummy'),(12,'dummy'),(13,'dummy'),(14,'dummy'),(15,'dummy'),(16,'dummy'),(17,'dummy'),(18,'dummy'),(19,'dummy'),(20,'dummy'),(21,'dummy'),(22,'dummy'),(23,'dummy'),(24,'dummy'),(25,'dummy'),(26,'dummy'),(27,'dummy'),(28,'dummy'),(29,'dummy'),(30,'dummy'),(31,'dummy'),(32,'dummy'),(33,'dummy'),(34,'dummy'),(35,'dummy'),(36,'dummy'),(37,'dummy'),(38,'dummy'),(39,'dummy'),(40,'dummy'),(41,'dummy'),(42,'dummy'),(43,'dummy'),(44,'dummy'),(45,'dummy'),(46,'dummy'),(47,'dummy'),(48,'dummy'),(49,'dummy'),(50,'dummy'),(51,'dummy'),(52,'dummy'),(53,'dummy'),(54,'dummy'),(55,'dummy'),(56,'dummy'),(57,'dummy'),(58,'dummy'),(59,'dummy'),(60,'dummy'),(61,'dummy'),(62,'dummy'),(63,'dummy'),(64,'dummy'),(65,'dummy'),(66,'dummy'),(67,'dummy'),(68,'dummy'),(69,'dummy'),(70,'dummy'),(71,'dummy'),(72,'dummy'),(73,'dummy'),(74,'dummy'),(75,'dummy'),(76,'dummy'),(77,'dummy'),(78,'dummy'),(79,'dummy'),(80,'dummy'),(81,'dummy'),(82,'dummy'),(83,'dummy'),(84,'dummy'),(85,'dummy'),(86,'dummy'),(87,'dummy'),(88,'dummy'),(89,'dummy'),(90,'dummy'),(91,'dummy'),(92,'dummy'),(93,'dummy'),(94,'dummy'),(95,'dummy'),(96,'dummy'),(97,'dummy'),(98,'dummy'),(99,'dummy'),(100,'dummy'),(101,'dummy'),(102,'dummy'),(103,'dummy'),(104,'dummy'),(105,'dummy'),(106,'dummy'),(107,'dummy'),(108,'dummy'),(109,'dummy'),(110,'dummy'),(111,'dummy'),(112,'dummy'),(113,'dummy'),(114,'dummy'),(115,'dummy'),(116,'dummy'),(117,'dummy'),(118,'dummy'),(119,'dummy'),(120,'dummy'),(121,'dummy'),(122,'dummy'),(123,'dummy'),(124,'dummy'),(125,'dummy'),(126,'dummy'),(127,'dummy'),(128,'dummy'),(129,'dummy'),(130,'dummy'),(131,'dummy'),(132,'dummy'),(133,'dummy'),(134,'dummy'),(135,'dummy'),(136,'dummy'),(137,'dummy'),(138,'dummy'),(139,'dummy'),(140,'dummy'),(141,'dummy'),(142,'dummy'),(143,'dummy'),(144,'dummy'),(145,'dummy'),(146,'dummy'),(147,'dummy'),(148,'dummy'),(149,'dummy'),(150,'dummy'),(151,'dummy'),(152,'dummy'),(153,'dummy'),(154,'dummy'),(155,'dummy'),(156,'dummy'),(157,'dummy'),(158,'dummy'),(159,'dummy'),(160,'dummy');
/*!40000 ALTER TABLE `GENERATOR` ENABLE KEYS */;
UNLOCK TABLES;
--
-- Table structure for table `HEALTH_HISTORY`
--
DROP TABLE IF EXISTS `HEALTH_HISTORY`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `HEALTH_HISTORY` (
`DIARY_DAY` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`TOTAL_CALORIES_OUT` double NOT NULL,
`TOTAL_MINUTES_RECORDED` double NOT NULL,
`STEPS` double DEFAULT NULL,
`ELEVATION` double DEFAULT NULL,
`AVG_REST_HR` double DEFAULT NULL,
`REST_MINUTES` double DEFAULT NULL,
`REST_CAL_OUT` double DEFAULT NULL,
`AVG_FAT_BURN_HR` double DEFAULT NULL,
`FAT_BURN_MINUTES` double DEFAULT NULL,
`FAT_BURN_CAL_OUT` double DEFAULT NULL,
`AVG_CARDIO_HR` double DEFAULT NULL,
`CARDIO_MINUTES` double DEFAULT NULL,
`CARDIO_CAL_OUT` double DEFAULT NULL,
`AVG_PEAK_HR` double DEFAULT NULL,
`PEAK_MINUTES` double DEFAULT NULL,
`PEAK_CAL_OUT` double DEFAULT NULL,
PRIMARY KEY (`DIARY_DAY`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/*!40101 SET character_set_client = @saved_cs_client */;
--
-- Table structure for table `NUTRITION_HISTORY`
--
DROP TABLE IF EXISTS `NUTRITION_HISTORY`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `NUTRITION_HISTORY` (
`DIARY_DAY` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`MEAL` varchar(50) NOT NULL,
`TOTAL_CALORIES_IN` double NOT NULL,
`CARB` double NOT NULL,
`PROT` double NOT NULL,
`FAT` double NOT NULL,
PRIMARY KEY (`DIARY_DAY`,`MEAL`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;
--
-- Table structure for table `SLEEP_HISTORY`
--
DROP TABLE IF EXISTS `SLEEP_HISTORY`;
/*!40101 SET @saved_cs_client = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `SLEEP_HISTORY` (
`DIARY_DAY` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`TOTAL_MINUTES_ASLEEP` double NOT NULL,
`TOTAL_MINUTES_IN_BED` double NOT NULL,
`REM_MINUTES` double NOT NULL,
`LIGHT_MINUTES` double NOT NULL,
`DEEP_MINUTES` double NOT NULL,
`WAKE_MINUTES` double NOT NULL,
PRIMARY KEY (`DIARY_DAY`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
/*!40101 SET character_set_client = @saved_cs_client */;
/*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */;
/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
-- Dump completed on 2018-11-07 0:14:54
You will note a GENERATOR table has been created. This is necessary to create date series queries, as detailed later.
Step 3: Update your BMQ every day
To get your perceived BMQ, you will have to enter it manually. Before we develop the BMQ app to do that, you can enter it using the following example queries every day:
insert into BMQ_HISTORY (BMQ, TYPE, TIME_ENTERED) values
('3', 'morning', '2018-11-06 7:00:00');
insert into BMQ_HISTORY (BMQ, TYPE, TIME_ENTERED) values
('3', 'pre-workout', '2018-11-06 7:50:00');
insert into BMQ_HISTORY (BMQ, TYPE, TIME_ENTERED) values
('4', 'post-workout', '2018-11-06 8:50:00');
insert into BMQ_HISTORY (BMQ, TYPE, TIME_ENTERED) values
('3', 'evening', now());
Section 2: Creating a Nifi flow to consume MyFitnessPal CSV exports
The goal is to setup the following flow:
Step 1: Survey folder for new CSV Files List Files in MFP Folder: ListFile Processor
This processor will monitor a folder for new files, based on timestamp, then maintain the last file read in its state, as seen below after right-clicking on the processor and selecting view state:
Here is a list of properties to configure that are not default:
Input directory: [YOUR_LOCAL_PATH]
File filter: .*
Run Schedule: 12 h Get Latest File: FetchFile Processor
This processor gets the file from the list generated previously and sends it in a flow file. It is configured using default properties. Step 2: Convert CSV to JSON Update Schema Name: UpdateAttribute Processor
This processor updates the schema name attribute (later used by next processors and controller services). HortonworksSchemaRegistry Controller Service
This controller service is pointing to your HDF Schema Registry. The only non-default property to configure in the controller service before enabling is:
Schema Registry URL: http://[YOUR_SCHEMA_REGISTRY_HOST]:7788/api/v1/ CSVReader Controller Service
This controller service is going to read your CSV and use your Hortonworks Schema Registry to parse its schema. The following non-default properties should be configured in the controller service before enabling is:
Schema Access Strategy: Use 'Schema Name' Property
Schema Registry: HortonworksSchemaRegistry
Treat First Line as Header: true
Ignore CSV Header Column Names: true JsonRecordSetWriter Controller Service
This controller service is going to write your JSON output based on your Hortonworks Schema Registry. The following non-default properties should be configured in the controller service before enabling is:
Schema Write Strategy: Set 'schema.name' Attribute
Schema Access Strategy: Use 'Schema Name' Property
Schema Registry: HortonworksSchemaRegistry Convert CSV to JSON: ConvertRecord Processor
This processor does the record conversion. Configure it with your enabled controllers as such:
Record Reader: CSVReader
Record Writer: JsonRecordSetWriter Step 3: Get fields needed for insert SplitJson Processor
This processor splits each meal in your export in a flow file, by using the following non-default property:
JsonPath Expression: $.[*] Extract Relevant Attributes: EvaluateJsonPath Processor
This processor extracts the relevant elements of your split JSON into attributes. It is configured using the following:
Destination: flowfile-attribute
carb: $.carbohydrates_g
diary_day: $.date
fat: $.fat_g
meal: $.meal
prot: $.protein_g
total_calories_in: $.calories Step 4: Insert Data to MySQL Create Upsert Query: ReplaceText Processor
This processor creates the query to be executed on your MySQL server to upsert the different lin. Here is the ReplacementValue you should use:
INSERT INTO NUTRITION_HISTORY
(DIARY_DAY, MEAL, TOTAL_CALORIES_IN, CARB, PROT, FAT)
VALUES
('${diary_day}', '${meal}', ${total_calories_in}, ${carb}, ${prot}, ${fat}) ON DUPLICATE KEY UPDATE
TOTAL_CALORIES_IN = ${total_calories_in},
CARB = ${carb},
PROT = ${prot},
FAT = ${fat}
DBCPConnectionPool Controller Service
This controller service is enabling connection to your MySQL server, configured by the following non-default properties:
Database Connection URL: jdbc:mysql://[YOUR_MYSQL_SERVER_ADDRESS]:3306/BEAST_MODE_DB?useLegacyDatetimeCode=false&serverTimezone=America/New_York
Database Driver Class Name: com.mysql.jdbc.Driver
Database Driver Location(s): file:///home/nifi/mysql_jdbc_drivers/mysql-connector-java-8.0.11.jar (or wherever you put your mysql jdbc driver) Upsert into MySQL: PutSQL Processor
This processor executes the query generated in the ReplaceText processor, and relies on your DBCPConnectionPool controller service, configured as such:
JDBC Connection Pool: DBCPConnectionPool Section 3: Creating a Nifi flow to consume Fitbit health & sleep data
The goal is to setup the following flow (separated in two screenshot, so you know it's going to be fun). It assumes that you registered an application with the Fitbit API and have a refresh/bearer token ready with the right permissions (see details over at
Fitbit documentation).
Step 1: Get Delta parameters Get Latest Date: ExecuteSQL Processor
This processor runs a query to get a series of dates not yet in the BMQ DB in order to run the Fitbit API. It is relying on the configuration of DBCPConnectionPool detailed above (configured in the property Database Connection Pooling Service, and a 12 h Run Schedule). Here is the query it should run:
select list_date from (select
date_format(
adddate((select IFNULL(max(hist.diary_day),'2018-11-07') from HEALTH_HISTORY hist), @num:=@num+1),
'%Y-%m-%d'
) list_date
from
GENERATOR,
(select @num:=-1) num
limit
150) dates
where list_date < CURRENT_DATE()
I realize this is not the only method to generate such a list (using GenerateTableFetch, etc.). This makes sure that I only run what the DB needs if something else changes the DB. ConvertAvroToJSON Processor
This processor converts the Avro response of the previous processor. Make sure you configure the following property to avoid splitting error when only one record is returned.
Wrap Single Record: true SplitJson Processor
This processor splits each record in the query in a flow file, by using the following non-default property:
JsonPath Expression: $.* Extract Relevant Attributes: EvaluateJsonPath Processor
This processor extracts the relevant elements of your split JSON into attributes. It is configured using the following:
Destination: flowfile-attribute
DIARY_DAY: $.list_date Step 2: Handle Token Refresh DistributedMapCacheServer Controller Service
This controller service is the server that will run the distributed cache. I should be configured using all default properties before being enabled. DistributedMapCacheClientService Controller Service
This controller service is connecting to the DistributedMapCacheServer configured previously. Before enabling it, just configure the following:
Server Hostname: [YOUR_HOSTNAME] Get Token From Cache: FetchDistributedMapCache Processor
This processor will try and fetch a refresh token from the DistributedMapCacheServer. If not found, we will send it to an UpdateAttribute to set a default value (e.g. for the first run), otherwise we will send it to the InvokeHTTP to get a new token. Here are the non-default properties to be configured:
Cache Entry Identifier: refresh-token-fitbit
Distributed Cache Service: DistributedMapCacheClientService
Put Cache Value In Attribute: CurrentRefreshToken
A few notes: This configuration requests a refresh token every run, which works with my 12 h run schedule. Ideally, you would store the expiration in distributed cache as well and only request a refresh token then, but I didn't want to overcomplicate things. Moreover, in a production environment, I would probably recommend to persist these token on disk and not only in memory in case of failure. Use Default Value if Not Found: UpdateAttribute Processor
This processor updates the CurrentRefreshToken if FetchDistributedMapCache returns not-found. Configure it using this property:
CurrentRefreshToken: [A_VALID_REFRESH_TOKEN] Refresh Fitbit Token: InvokeHTTP Processor
This calls the Fitbit API to get a new valid token and a refresh token. Configure the following non-default properties to run it:
HTTP Method: POST
Remote URL: https://api.fitbit.com/oauth2/token?grant_type=refresh_token&refresh_token=${CurrentRefreshToken}
Content-Type: application/x-www-form-urlencoded Authorization:
Basic [YOUR_AUTHORIZATION_CODE] (see Fibit Documentation) Extract Token: EvaluateJsonPath Processor
This processor extracts the bearer and refresh tokens from the HTTP response. It is configured using the following:
Destination: flowfile-attribute
carb: $.access_token
diary_day: $.refresh_token Extract Token: EvaluateJsonPath Processor
This processor extracts the bearer and refresh tokens from the HTTP response, then continues to two routes: one storing back the refresh token in cache, the other one using the bearer token to call the Fitbit API. It is configured using the following:
Destination: flowfile-attribute
carb: $.access_token
diary_day: $.refresh_token Put Refresh Token in Flowfile: ReplaceText Processor
This processor puts the refresh token in flowfile in order to be consumed:
${REFRESH_TOKEN} PutDistributedMapCache Processor
This processor takes the refresh token in the flowfile and stores it under refresh-token-fitbit for next time the flow is executed:
Cache Entry Identifier: refresh-token-fitbit
Distributed Cache Service: DistributedMapCacheClientService Step 3: Call Fitbit APIs (Health and Sleep)
For this step, I will only detail one processor. First, because, as you can see on the screenshot above, both Get FitBit Daily Summary are very similar (they are just calling diffrent endpoints of the fitbit API). Secondly, because the flow Extract Relevant Attributes > Create SQL Query > Upsert to MySQL has been described above. Finally, because I will keep the Calculate Averages and Sums for the next flow I run in Strava. All detailed processors can be found here:
feed-fitbit-anonymized.xml Get FitBit Daily Summary: InvokeHTTP Processor
This calls the Fitbit API to get a daily health summary:
HTTP Method: GET
Remote URL: https://api.fitbit.com/1/user/-/activities/date/${DIARY_DAY}.json
Authorization: Bearer ${BEARER_TOKEN} Section 4: Creating a Nifi flow to consume Strava activity data
As for the previous section, the goal is to setup a flow calling Strava's activity API. It assumes that you registered an application with the Strava API and have a refresh/bearer token ready with the right permissions (see details over at
Strava documentation). As opposed with Fitbit the refresh token remains the same, so no need for distributed cache. I'm only going to detail the Convert Units processor to talk about Nifi language. The rest of the flow can be found here: feed-strava-anonymized.xml
Convert Units: UpdateAttribute Processor
This processor uses some of the data extracted to convert units before putting them into attributes later on used in your upsert query. Here are the calculation executed:
AVG_HR: ${AVG_HR:isEmpty():ifElse(0.0, ${AVG_HR})} (makes sure that we have a value if the API does not return HR, in the case of HR not present in activity)
AVG_PACE: ${MILE_IN_METERS:divide(${AVG_PACE:toDecimal():multiply(60.0)})} (converts speed from m/s to min/mile)
DISTANCE: ${DISTANCE:divide(${MILE_IN_METERS})} (converts distance from meters to miles)
DURATION: ${DURATION:toDecimal():divide(60.0)} (converts duration from seconds to mins)
START_TIME: ${START_TIME:replaceAll("T"," "):replaceAll("Z","")} (converts timestamp format to MySQL format)
... View more
- « Previous
- Next »