About pvidal

pvidal · ‎04-23-2019

Introduction Overview Cloudera/Hortonworks offers today one of the most comprehensive data management platform, with components allowing you data flow management to governed and distributed Data Science workloads. With so many toys to play with, I thought I'd share an easy way to setup a simple cluster that will, using Cloudbreak, setup the following main components on Azure cloud: Hortonworks Data Platform 3.1 Hortonworks Data Flow 3.3 Data Platform Search 4.0 Cloudera Data Science Workbench 1.5 Note: This is not a production-ready setup, but merely a first step to customizing your deployment using the Cloudera toolkit. Pre-Requisites Account on Azure with permission to assign roles Cloudbreak 2.9 (you can always set one up on your machine using: https://community.hortonworks.com/articles/194076/using-vagrant-and-virtualbox-to-create-a-local-ins.html) Tutorial steps Step 1: Setup Azure Credentials Step 2: Setup blueprint and cluster extensions Step 3: Create cluster Step 1: Setup Azure Credentials Find your Azure subscription and tenant ID To find your subscription ID, go to the search box and look for subscription; you should find it as such: For the tenant ID, use the Azure AD Directory ID: Setup your credentials in Cloudbreak This part is extremely well documented in Cloudbreak's documentation portal: https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-2.9.0/create-credential-azure/content/cb_create-app-based-credential.html. Note: Because of IT restrictions on my side, I chose to use an app based credential setup, but if you have enough privileges, Cloudbreak creates the app and assigns roles automagically for you. Step 2: Setup blueprint and cluster extensions Blueprint First upload the blueprint below: { "Blueprints": { "blueprint_name": "edge-to-ai-3.1", "stack_name": "HDP", "stack_version": "3.1" }, "configurations": [ { "yarn-site": { "properties": { "yarn.nodemanager.resource.cpu-vcores": "6", "yarn.nodemanager.resource.memory-mb": "60000", "yarn.scheduler.maximum-allocation-mb": "14" } } }, { "hdfs-site": { "properties": { "dfs.cluster.administrators": "hdfs" } } }, { "capacity-scheduler": { "properties": { "yarn.scheduler.capacity.maximum-am-resource-percent": "0.4", "yarn.scheduler.capacity.root.capacity": "67", "yarn.scheduler.capacity.root.default.capacity": "67", "yarn.scheduler.capacity.root.default.maximum-capacity": "67", "yarn.scheduler.capacity.root.llap.capacity": "33", "yarn.scheduler.capacity.root.llap.maximum-capacity": "33", "yarn.scheduler.capacity.root.queues": "default,llap" } } }, { "ranger-hive-audit": { "properties": { "xasecure.audit.destination.hdfs.file.rollover.sec": "300" }, "properties_attributes": {} } }, { "hive-site": { "hive.exec.compress.output": "true", "hive.merge.mapfiles": "true", "hive.metastore.dlm.events": "true", "hive.metastore.transactional.event.listeners": "org.apache.hive.hcatalog.listener.DbNotificationListener", "hive.repl.cm.enabled": "true", "hive.repl.cmrootdir": "/apps/hive/cmroot", "hive.repl.rootdir": "/apps/hive/repl", "hive.server2.tez.initialize.default.sessions": "true", "hive.server2.transport.mode": "http" } }, { "hive-interactive-env": { "enable_hive_interactive": "true", "hive_security_authorization": "Ranger", "num_llap_nodes": "1", "num_llap_nodes_for_llap_daemons": "1", "num_retries_for_checking_llap_status": "50" } }, { "hive-interactive-site": { "hive.exec.orc.split.strategy": "HYBRID", "hive.llap.daemon.num.executors": "5", "hive.metastore.rawstore.impl": "org.apache.hadoop.hive.metastore.cache.CachedStore", "hive.stats.fetch.bitvector": "true" } }, { "spark2-defaults": { "properties": { "spark.datasource.hive.warehouse.load.staging.dir": "/tmp", "spark.datasource.hive.warehouse.metastoreUri": "thrift://%HOSTGROUP::master1%:9083", "spark.hadoop.hive.zookeeper.quorum": "{{zookeeper_quorum_hosts}}", "spark.sql.hive.hiveserver2.jdbc.url": "jdbc:hive2://{{zookeeper_quorum_hosts}}:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive", "spark.sql.hive.hiveserver2.jdbc.url.principal": "hive/_HOST@EC2.INTERNAL" }, "properties_attributes": {} } }, { "gateway-site": { "properties": { "gateway.path": "{{cluster_name}}" }, "properties_attributes": {} } }, { "admin-topology": { "properties": { "content": "\n \n\n \n\n \n authentication\n ShiroProvider\n true\n \n sessionTimeout\n 30\n \n \n main.ldapRealm\n org.apache.hadoop.gateway.shirorealm.KnoxLdapRealm\n \n \n main.ldapRealm.userDnTemplate\n uid={0},ou=people,dc=hadoop,dc=apache,dc=org\n \n \n main.ldapRealm.contextFactory.url\n ldap://54.219.163.9:33389\n \n \n main.ldapRealm.contextFactory.authenticationMechanism\n simple\n \n \n urls./**\n authcBasic\n \n \n\n \n authorization\n AclsAuthz\n true\n \n \n\n \n KNOX\n \n\n " }, "properties_attributes": {} } }, { "ranger-admin-site": { "properties": { "ranger.jpa.jdbc.url": "jdbc:postgresql://localhost:5432/ranger" }, "properties_attributes": {} } }, { "ranger-env": { "properties": { "is_solrCloud_enabled": "true", "keyadmin_user_password": "{{{ general.password }}}", "ranger-atlas-plugin-enabled": "Yes", "ranger-hdfs-plugin-enabled": "Yes", "ranger-hive-plugin-enabled": "Yes", "ranger-knox-plugin-enabled": "Yes", "ranger_admin_password": "{{{ general.password }}}", "rangertagsync_user_password": "{{{ general.password }}}", "rangerusersync_user_password": "{{{ general.password }}}" }, "properties_attributes": {} } }, { "ams-hbase-site": { "properties": { "hbase.cluster.distributed": "true", "hbase.rootdir": "file:///hadoopfs/fs1/metrics/hbase/data" } } }, { "atlas-env": { "properties": { "atlas.admin.password": "admin", "atlas_solr_shards": "2", "content": "\n # The java implementation to use. If JAVA_HOME is not found we expect java and jar to be in path\n export JAVA_HOME={{java64_home}}\n\n # any additional java opts you want to set. This will apply to both client and server operations\n {% if security_enabled %}\n export ATLAS_OPTS=\"{{metadata_opts}} -Djava.security.auth.login.config={{atlas_jaas_file}}\"\n {% else %}\n export ATLAS_OPTS=\"{{metadata_opts}}\"\n {% endif %}\n\n # metadata configuration directory\n export ATLAS_CONF={{conf_dir}}\n\n # Where log files are stored. Defatult is logs directory under the base install location\n export ATLAS_LOG_DIR={{log_dir}}\n\n # additional classpath entries\n export ATLASCPPATH={{metadata_classpath}}\n\n # data dir\n export ATLAS_DATA_DIR={{data_dir}}\n\n # pid dir\n export ATLAS_PID_DIR={{pid_dir}}\n\n # hbase conf dir\n export HBASE_CONF_DIR=\"/etc/ams-hbase/conf\"\n\n # Where do you want to expand the war file. By Default it is in /server/webapp dir under the base install dir.\n export ATLAS_EXPANDED_WEBAPP_DIR={{expanded_war_dir}}\n export ATLAS_SERVER_OPTS=\"-server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$ATLAS_LOG_DIR/atlas_server.hprof -Xloggc:$ATLAS_LOG_DIRgc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps\"\n {% if java_version == 8 %}\n export ATLAS_SERVER_HEAP=\"-Xms{{atlas_server_xmx}}m -Xmx{{atlas_server_xmx}}m -XX:MaxNewSize={{atlas_server_max_new_size}}m -XX:MetaspaceSize=100m -XX:MaxMetaspaceSize=512m\"\n {% else %}\n export ATLAS_SERVER_HEAP=\"-Xms{{atlas_server_xmx}}m -Xmx{{atlas_server_xmx}}m -XX:MaxNewSize={{atlas_server_max_new_size}}m -XX:MaxPermSize=512m\"\n {% endif %}\n", "hbase_conf_dir": "/etc/ams-hbase/conf" } } }, { "kafka-broker": { "properties": { "default.replication.factor": "1", "offsets.topic.replication.factor": "1" }, "properties_attributes": {} } }, { "hbase-env": { "properties": { "phoenix_sql_enabled": "true" }, "properties_attributes": {} } }, { "druid-common": { "properties": { "druid.extensions.loadList": "[\"postgresql-metadata-storage\", \"druid-datasketches\", \"druid-hdfs-storage\", \"druid-kafka-indexing-service\", \"ambari-metrics-emitter\"]", "druid.indexer.logs.directory": "/user/druid/logs", "druid.indexer.logs.type": "hdfs", "druid.metadata.storage.connector.connectURI": "jdbc:postgresql://%HOSTGROUP::master1%:5432/druid", "druid.metadata.storage.connector.password": "druid", "druid.metadata.storage.connector.user": "druid", "druid.metadata.storage.type": "postgresql", "druid.selectors.indexing.serviceName": "druid/overlord", "druid.storage.storageDirectory": "/user/druid/data", "druid.storage.type": "hdfs" }, "properties_attributes": {} } }, { "druid-overlord": { "properties": { "druid.indexer.runner.type": "remote", "druid.indexer.storage.type": "metadata", "druid.port": "8090", "druid.service": "druid/overlord" }, "properties_attributes": {} } }, { "druid-middlemanager": { "properties": { "druid.indexer.runner.javaOpts": "-server -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -Dhdp.version={{stack_version}} -Dhadoop.mapreduce.job.classloader=true", "druid.port": "8091", "druid.processing.numThreads": "2", "druid.server.http.numThreads": "50", "druid.service": "druid/middlemanager", "druid.worker.capacity": "3" }, "properties_attributes": {} } }, { "druid-coordinator": { "properties": { "druid.coordinator.merge.on": "false", "druid.port": "8081" }, "properties_attributes": {} } }, { "druid-historical": { "properties": { "druid.port": "8083", "druid.processing.numThreads": "2", "druid.server.http.numThreads": "50", "druid.server.maxSize": "300000000000", "druid.service": "druid/historical" }, "properties_attributes": {} } }, { "druid-broker": { "properties": { "druid.broker.http.numConnections": "5", "druid.cache.type": "local", "druid.port": "8082", "druid.processing.numThreads": "2", "druid.server.http.numThreads": "50", "druid.service": "druid/broker" }, "properties_attributes": {} } }, { "druid-router": { "properties": {}, "properties_attributes": {} } }, { "superset": { "properties": { "SECRET_KEY": "{{{ general.password }}}", "SUPERSET_DATABASE_TYPE": "sqlite" }, "properties_attributes": {} } }, { "nifi-ambari-config": { "nifi.max_mem": "4g", "nifi.security.encrypt.configuration.password": "{{{ general.password }}}", "nifi.sensitive.props.key": "{{{ general.password }}}" } }, { "nifi-properties": { "nifi.security.user.login.identity.provider": "", "nifi.sensitive.props.key": "{{{ general.password }}}" } }, { "nifi-registry-ambari-config": { "nifi.registry.security.encrypt.configuration.password": "{{{ general.password }}}" } }, { "nifi-registry-properties": { "nifi.registry.db.password": "{{{ general.password }}}", "nifi.registry.sensitive.props.key": "{{{ general.password }}}" } }, { "registry-common": { "properties": { "adminPort": "7789", "database_name": "registry", "jar.storage": "/hdf/registry", "jar.storage.hdfs.url": "hdfs://localhost:9090", "jar.storage.type": "local", "port": "7788", "registry.schema.cache.expiry.interval": "3600", "registry.schema.cache.size": "10000", "registry.storage.connector.connectURI": "jdbc:mysql://localhost:3306/registry", "registry.storage.connector.password": "registry", "registry.storage.connector.user": "registry", "registry.storage.query.timeout": "30", "registry.storage.type": "mysql" }, "properties_attributes": {} } }, { "hbase-site": { "properties": { "hbase.bucketcache.combinedcache.enabled": "true", "hbase.bucketcache.ioengine": "file:/hbase/cache", "hbase.bucketcache.size": "24000", "hbase.defaults.for.version.skip": "true", "hbase.hregion.max.filesize": "21474836480", "hbase.hregion.memstore.flush.size": "536870912", "hbase.region.server.rpc.scheduler.factory.class": "org.apache.hadoop.hbase.ipc.PhoenixRpcSchedulerFactory", "hbase.regionserver.global.memstore.size": "0.4", "hbase.regionserver.handler.count": "60", "hbase.regionserver.wal.codec": "org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec", "hbase.rootdir": "/apps/hbase", "hbase.rpc.controllerfactory.class": "org.apache.hadoop.hbase.ipc.controller.ServerRpcControllerFactory", "hbase.rs.cacheblocksonwrite": "true", "hfile.block.bloom.cacheonwrite": "true", "hfile.block.cache.size": "0.4", "hfile.block.index.cacheonwrite": "true", "phoenix.functions.allowUserDefinedFunctions": "true", "phoenix.query.timeoutMs": "60000" }, "properties_attributes": {} } }, { "hbase-env": { "properties": { "hbase_java_io_tmpdir": "/tmp", "hbase_log_dir": "/var/log/hbase", "hbase_master_heapsize": "1024m", "hbase_pid_dir": "/var/run/hbase", "hbase_regionserver_heapsize": "16384m", "hbase_regionserver_shutdown_timeout": "30", "hbase_regionserver_xmn_max": "16384", "hbase_regionserver_xmn_ratio": "0.2", "hbase_user": "hbase", "hbase_user_nofile_limit": "32000", "hbase_user_nproc_limit": "16000", "phoenix_sql_enabled": "true" }, "properties_attributes": {} } } ], "host_groups": [ { "cardinality": "1", "components": [ { "name": "RANGER_TAGSYNC" }, { "name": "RANGER_USERSYNC" }, { "name": "RANGER_ADMIN" }, { "name": "KNOX_GATEWAY" }, { "name": "HIVE_SERVER" }, { "name": "HIVE_METASTORE" }, { "name": "DRUID_OVERLORD" }, { "name": "DRUID_COORDINATOR" }, { "name": "DRUID_ROUTER" }, { "name": "DRUID_BROKER" }, { "name": "SECONDARY_NAMENODE" }, { "name": "HISTORYSERVER" }, { "name": "APP_TIMELINE_SERVER" }, { "name": "REGISTRY_SERVER" }, { "name": "NIFI_REGISTRY_MASTER" }, { "name": "DATANODE" }, { "name": "YARN_CLIENT" }, { "name": "HDFS_CLIENT" }, { "name": "TEZ_CLIENT" }, { "name": "INFRA_SOLR_CLIENT" }, { "name": "ZOOKEEPER_CLIENT" }, { "name": "MAPREDUCE2_CLIENT" }, { "name": "ATLAS_CLIENT" }, { "name": "HBASE_CLIENT" }, { "name": "HIVE_CLIENT" }, { "name": "LOGSEARCH_LOGFEEDER" }, { "name": "SPARK2_CLIENT" } ], "name": "master1" }, { "cardinality": "1+", "components": [ { "name": "NODEMANAGER" }, { "name": "DATANODE" }, { "name": "HBASE_REGIONSERVER" }, { "name": "MAPREDUCE2_CLIENT" }, { "name": "YARN_CLIENT" }, { "name": "HDFS_CLIENT" }, { "name": "TEZ_CLIENT" }, { "name": "ZOOKEEPER_CLIENT" }, { "name": "ATLAS_CLIENT" }, { "name": "HBASE_CLIENT" }, { "name": "HIVE_CLIENT" }, { "name": "METRICS_MONITOR" }, { "name": "LOGSEARCH_LOGFEEDER" }, { "name": "KAFKA_BROKER" }, { "name": "NIFI_MASTER" }, { "name": "ZOOKEEPER_SERVER" }, { "name": "HBASE_CLIENT" }, { "name": "SPARK2_CLIENT" } ], "name": "worker" }, { "cardinality": "1", "components": [ { "name": "ATLAS_SERVER" }, { "name": "HBASE_MASTER" }, { "name": "METRICS_COLLECTOR" }, { "name": "RESOURCEMANAGER" }, { "name": "DRUID_HISTORICAL" }, { "name": "DRUID_MIDDLEMANAGER" }, { "name": "LIVY2_SERVER" }, { "name": "SPARK2_JOBHISTORYSERVER" }, { "name": "DATANODE" }, { "name": "HIVE_CLIENT" }, { "name": "ZOOKEEPER_CLIENT" }, { "name": "ATLAS_CLIENT" }, { "name": "MAPREDUCE2_CLIENT" }, { "name": "TEZ_CLIENT" }, { "name": "HBASE_CLIENT" }, { "name": "METRICS_MONITOR" }, { "name": "LOGSEARCH_LOGFEEDER" }, { "name": "LOGSEARCH_SERVER" }, { "name": "NAMENODE" }, { "name": "SUPERSET" }, { "name": "NIFI_CA" }, { "name": "INFRA_SOLR" }, { "name": "METRICS_GRAFANA" }, { "name": "METRICS_MONITOR" }, { "name": "HBASE_MASTER" }, { "name": "HBASE_CLIENT" }, { "name": "SPARK2_CLIENT" } ], "name": "master2" }, { "name": "cdsw_worker", "cardinality": "1+", "components": [ { "name": "SPARK2_CLIENT" }, { "name": "ZOOKEEPER_CLIENT" }, { "name": "YARN_CLIENT" }, { "name": "HDFS_CLIENT" }, { "name": "MAPREDUCE2_CLIENT" }, { "name": "HIVE_CLIENT" }, { "name": "NODEMANAGER" }, { "name": "DATANODE" }, { "name": "KAFKA_BROKER" }, { "name": "NIFI_MASTER" }, { "name": "ZOOKEEPER_SERVER" }, { "name": "HBASE_REGIONSERVER" }, { "name": "HBASE_CLIENT" }, { "name": "METRICS_MONITOR" }, { "name": "TEZ_CLIENT" } ] } ], "settings": [ { "recovery_settings": [ { "recovery_enabled": "false" } ] } ] } Recipes Pre Ambari start recipe to setup metastores #!/usr/bin/env bash # Intialize MetaStores yum install -y https://download.postgresql.org/pub/repos/yum/9.6/redhat/rhel-7-x86_64/pgdg-redhat96-9.6-3.noarch.rpm yum install -y postgresql96-server yum install -y postgresql96-contrib /usr/pgsql-9.6/bin/postgresql96-setup initdb sed -i 's,#port = 5432,port = 5433,g' /var/lib/pgsql/9.6/data/postgresql.conf echo '' > /var/lib/pgsql/9.6/data/pg_hba.conf echo 'local all das,streamsmsgmgr,cloudbreak,registry,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid trust ' >> /var/lib/pgsql/9.6/data/pg_hba.conf echo 'host all das,streamsmsgmgr,cloudbreak,registry,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid 0.0.0.0/0 trust ' >> /var/lib/pgsql/9.6/data/pg_hba.conf echo 'host all das,streamsmsgmgr,cloudbreak,registry,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid ::/0 trust ' >> /var/lib/pgsql/9.6/data/pg_hba.conf echo 'local all all peer ' >> /var/lib/pgsql/9.6/data/pg_hba.conf echo 'host all all 127.0.0.1/32 ident ' >> /var/lib/pgsql/9.6/data/pg_hba.conf echo 'host all all ::1/128 ident ' >> /var/lib/pgsql/9.6/data/pg_hba.conf systemctl enable postgresql-9.6.service systemctl start postgresql-9.6.service echo "CREATE DATABASE streamsmsgmgr;" | sudo -u postgres psql -U postgres -h localhost -p 5433 echo "CREATE USER streamsmsgmgr WITH PASSWORD 'streamsmsgmgr';" | sudo -u postgres psql -U postgres -h localhost -p 5433 echo "GRANT ALL PRIVILEGES ON DATABASE streamsmsgmgr TO streamsmsgmgr;" | sudo -u postgres psql -U postgres -h localhost -p 5433 echo "CREATE DATABASE druid;" | sudo -u postgres psql -U postgres echo "CREATE DATABASE ranger;" | sudo -u postgres psql -U postgres echo "CREATE DATABASE registry;" | sudo -u postgres psql -U postgres echo "CREATE USER druid WITH PASSWORD 'druid';" | sudo -u postgres psql -U postgres echo "CREATE USER registry WITH PASSWORD 'registry';" | sudo -u postgres psql -U postgres echo "CREATE USER rangerdba WITH PASSWORD 'rangerdba';" | sudo -u postgres psql -U postgres echo "CREATE USER rangeradmin WITH PASSWORD 'ranger';" | sudo -u postgres psql -U postgres echo "GRANT ALL PRIVILEGES ON DATABASE druid TO druid;" | sudo -u postgres psql -U postgres echo "GRANT ALL PRIVILEGES ON DATABASE registry TO registry;" | sudo -u postgres psql -U postgres echo "GRANT ALL PRIVILEGES ON DATABASE ranger TO rangerdba;" | sudo -u postgres psql -U postgres echo "GRANT ALL PRIVILEGES ON DATABASE ranger TO rangeradmin;" | sudo -u postgres psql -U postgres #ambari-server setup --jdbc-db=postgres --jdbc-driver=/usr/share/java/postgresql-jdbc.jar if [[ $(cat /etc/system-release|grep -Po Amazon) == "Amazon" ]]; then echo '' > /var/lib/pgsql/9.5/data/pg_hba.conf echo 'local all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry trust ' >> /var/lib/pgsql/9.5/data/pg_hba.conf echo 'host all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry 0.0.0.0/0 trust ' >> /var/lib/pgsql/9.5/data/pg_hba.conf echo 'host all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry ::/0 trust ' >> /var/lib/pgsql/9.5/data/pg_hba.conf echo 'local all all peer ' >> /var/lib/pgsql/9.5/data/pg_hba.conf echo 'host all all 127.0.0.1/32 ident ' >> /var/lib/pgsql/9.5/data/pg_hba.conf echo 'host all all ::1/128 ident ' >> /var/lib/pgsql/9.5/data/pg_hba.conf sudo -u postgres /usr/pgsql-9.5/bin/pg_ctl -D /var/lib/pgsql/9.5/data/ reload else echo '' > /var/lib/pgsql/data/pg_hba.conf echo 'local all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry trust ' >> /var/lib/pgsql/data/pg_hba.conf echo 'host all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry 0.0.0.0/0 trust ' >> /var/lib/pgsql/data/pg_hba.conf echo 'host all cloudbreak,ambari,postgres,hive,ranger,rangerdba,rangeradmin,rangerlogger,druid,registry ::/0 trust ' >> /var/lib/pgsql/data/pg_hba.conf echo 'local all all peer ' >> /var/lib/pgsql/data/pg_hba.conf echo 'host all all 127.0.0.1/32 ident ' >> /var/lib/pgsql/data/pg_hba.conf echo 'host all all ::1/128 ident ' >> /var/lib/pgsql/data/pg_hba.conf sudo -u postgres pg_ctl -D /var/lib/pgsql/data/ reload fi yum remove -y mysql57-community* yum remove -y mysql56-server* yum remove -y mysql-community* rm -Rvf /var/lib/mysql yum install -y epel-release yum install -y libffi-devel.x86_64 ln -s /usr/lib64/libffi.so.6 /usr/lib64/libffi.so.5 yum install -y mysql-connector-java* ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar if [ $(cat /etc/system-release|grep -Po Amazon) == Amazon ]; then yum install -y mysql56-server service mysqld start else yum localinstall -y https://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm yum install -y mysql-community-server systemctl start mysqld.service fi chkconfig --add mysqld chkconfig mysqld on ln -s /usr/share/java/mysql-connector-java.jar /usr/hdp/current/hive-client/lib/mysql-connector-java.jar ln -s /usr/share/java/mysql-connector-java.jar /usr/hdp/current/hive-server2-hive2/lib/mysql-connector-java.jar mysql --execute="CREATE DATABASE druid DEFAULT CHARACTER SET utf8" mysql --execute="CREATE DATABASE registry DEFAULT CHARACTER SET utf8" mysql --execute="CREATE DATABASE streamline DEFAULT CHARACTER SET utf8" mysql --execute="CREATE DATABASE streamsmsgmgr DEFAULT CHARACTER SET utf8" mysql --execute="CREATE USER 'das'@'localhost' IDENTIFIED BY 'dasuser'" mysql --execute="CREATE USER 'das'@'%' IDENTIFIED BY 'dasuser'" mysql --execute="CREATE USER 'ranger'@'localhost' IDENTIFIED BY 'ranger'" mysql --execute="CREATE USER 'ranger'@'%' IDENTIFIED BY 'ranger'" mysql --execute="CREATE USER 'rangerdba'@'localhost' IDENTIFIED BY 'rangerdba'" mysql --execute="CREATE USER 'rangerdba'@'%' IDENTIFIED BY 'rangerdba'" mysql --execute="CREATE USER 'registry'@'localhost' IDENTIFIED BY 'registry'" mysql --execute="CREATE USER 'registry'@'%' IDENTIFIED BY 'registry'" mysql --execute="CREATE USER 'streamsmsgmgr'@'localhost' IDENTIFIED BY 'streamsmsgmgr'" mysql --execute="CREATE USER 'streamsmsgmgr'@'%' IDENTIFIED BY 'streamsmsgmgr'" mysql --execute="CREATE USER 'druid'@'%' IDENTIFIED BY 'druid'" mysql --execute="CREATE USER 'streamline'@'%' IDENTIFIED BY 'streamline'" mysql --execute="CREATE USER 'streamline'@'localhost' IDENTIFIED BY 'streamline'" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'das'@'localhost'" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'das'@'%'" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'das'@'localhost' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'das'@'%' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'localhost'" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'%'" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'localhost' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'ranger'@'%' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'rangerdba'@'localhost'" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'rangerdba'@'%'" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'rangerdba'@'localhost' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'rangerdba'@'%' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON druid.* TO 'druid'@'%' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'registry'@'localhost'" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'registry'@'%'" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'registry'@'localhost' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'registry'@'%' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'streamsmsgmgr'@'localhost'" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'streamsmsgmgr'@'%'" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'streamsmsgmgr'@'localhost' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON *.* TO 'streamsmsgmgr'@'%' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON streamline.* TO 'streamline'@'%' WITH GRANT OPTION" mysql --execute="CREATE DATABASE beast_mode_db DEFAULT CHARACTER SET utf8" mysql --execute="CREATE USER 'bmq_user'@'localhost' IDENTIFIED BY 'Be@stM0de'" mysql --execute="CREATE USER 'bmq_user'@'%' IDENTIFIED BY 'Be@stM0de'" mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'localhost'" mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'%'" mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'localhost' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'%' WITH GRANT OPTION" mysql --execute="FLUSH PRIVILEGES" mysql --execute="COMMIT" #remount tmpfs to ensure NOEXEC is disabled if grep -Eq '^[^ ]+ /tmp [^ ]+ ([^ ]*,)?noexec[, ]' /proc/mounts; then echo "/tmp found as noexec, remounting..." mount -o remount,size=10G /tmp mount -o remount,exec /tmp else echo "/tmp not found as noexec, skipping..." fi Pre Ambari start recipe to grow the root volume for the CDSW worker #!/usr/bin/env bash # WARNING: This script is only for RHEL7 on Azure # growing the /dev/sda2 partition sed -e 's/\s*$[\+0-9a-zA-Z]*$.*/\1/' << EOF | fdisk /dev/sda d # delete 2 # delete partition 2 n # new p # partition 2 # partition 2 # default # default w # write the partition table q # and we're done EOF reboot Post cluster install recipe to setup CDSW #!/usr/bin/env bash # WARNING: This script is only for RHEL7 on Azure # growing the /dev/sda2 partition xfs_growfs /dev/sda2 # Some of these installs may be unecessary but are included for completeness against documentation yum -y install nfs-utils libseccomp lvm2 bridge-utils libtool-ltdl ebtables rsync policycoreutils-python ntp bind-utils nmap-ncat openssl e2fsprogs redhat-lsb-core socat selinux-policy-base selinux-policy-targeted # CDSW wants a pristine IPTables setup iptables -P INPUT ACCEPT iptables -P FORWARD ACCEPT iptables -P OUTPUT ACCEPT iptables -t nat -F iptables -t mangle -F iptables -F iptables -X # set java_home on centos7 #echo 'export JAVA_HOME=$(readlink -f /usr/bin/javac | sed "s:/bin/javac::")' >> /etc/profile #export JAVA_HOME=$(readlink -f /usr/bin/javac | sed "s:/bin/javac::") echo 'export JAVA_HOME=/usr/lib/jvm/java' >> /etc/profile export JAVA_HOME='/usr/lib/jvm/java' # Fetch public IP export MASTER_IP=$(hostname --ip-address) # Fetch public FQDN for Domain export DOMAIN=$(curl https://ipv4.icanhazip.com) cd /hadoopfs/ mkdir cdsw # Install CDSW #wget -q --no-check-certificate https://s3.eu-west-2.amazonaws.com/whoville/v2/temp.blob #mv temp.blob cloudera-data-science-workbench-1.5.0.818361-1.el7.centos.x86_64.rpm wget -q https://archive.cloudera.com/cdsw1/1.5.0/redhat7/yum/RPMS/x86_64/cloudera-data-science-workbench-1.5.0.849870-1.el7.centos.x86_64.rpm yum install -y cloudera-data-science-workbench-1.5.0.849870-1.el7.centos.x86_64.rpm # Install Anaconda curl -Ok https://repo.anaconda.com/archive/Anaconda2-5.2.0-Linux-x86_64.sh chmod +x ./Anaconda2-5.2.0-Linux-x86_64.sh ./Anaconda2-5.2.0-Linux-x86_64.sh -b -p /anaconda # create unix user useradd tutorial echo "tutorial-password" | passwd --stdin tutorial su - hdfs -c 'hdfs dfs -mkdir /user/tutorial' su - hdfs -c 'hdfs dfs -chown tutorial:hdfs /user/tutorial' # CDSW Setup sed -i "s@MASTER_IP=\"\"@MASTER_IP=\"${MASTER_IP}\"@g" /etc/cdsw/config/cdsw.conf sed -i "s@JAVA_HOME=\"/usr/java/default\"@JAVA_HOME=\"$(echo ${JAVA_HOME})\"@g" /etc/cdsw/config/cdsw.conf sed -i "s@DOMAIN=\"cdsw.company.com\"@DOMAIN=\"${DOMAIN}.xip.io\"@g" /etc/cdsw/config/cdsw.conf sed -i "s@DOCKER_BLOCK_DEVICES=\"\"@DOCKER_BLOCK_DEVICES=\"${DOCKER_BLOCK}\"@g" /etc/cdsw/config/cdsw.conf sed -i "s@APPLICATION_BLOCK_DEVICE=\"\"@APPLICATION_BLOCK_DEVICE=\"${APP_BLOCK}\"@g" /etc/cdsw/config/cdsw.conf sed -i "s@DISTRO=\"\"@DISTRO=\"HDP\"@g" /etc/cdsw/config/cdsw.conf sed -i "s@ANACONDA_DIR=\"\"@ANACONDA_DIR=\"/anaconda/bin\"@g" /etc/cdsw/config/cdsw.conf # CDSW will break default Amazon DNS on 127.0.0.1:53, so we use a different IP sed -i "s@nameserver 127.0.0.1@nameserver 169.254.169.253@g" /etc/dhcp/dhclient-enter-hooks cdsw init echo "CDSW will shortly be available on ${DOMAIN}" # after the init, we wait until we are able to create the tutorial user export respCode=404 while (( $respCode != 201 )) do sleep 10 export respCode=$(curl -iX POST http://${DOMAIN}.xip.io/api/v1/users/ -H 'Content-Type: application/json' -d '{"email":"tutorial@tutorial.com","name":"tutorial","username":"tutorial","password":"tutorial-password","type":"user","admin":true}' | grep HTTP | awk '{print $2}') done exit 0 Note: this script is using xip.io and hacks into unix to create user and hadoop folders, not a recommendation in production! Management packs IMAGE You will need two management packs for this setup, using the URL detailed below: HDF mpack: http://s3.amazonaws.com/dev.hortonworks.com/HDF/centos7/3.x/BUILDS/3.3.1.0-10/tars/hdf_ambari_mp/hdf-ambari-mpack-3.3.1.0-10.tar.gz Search mpack: http://public-repo-1.hortonworks.com/HDP-SOLR/hdp-solr-ambari-mp/solr-service-mpack-4.0.0.tar.gz Step 3: Create cluster This step uses Cloudbreak's Create Cluster wizard, and is pretty self-explanatory following screenshots, but I will add specific parameters in text form for convenience Note: Do not forget to toggle the advanced mode when running the wizard (top of the screen) General Configuration Image Settings Hardware and Storage Note: Make sure to use 100 GB as the root volume size for CDSW. Network and Availability Cloud Storage Cluster Extensions External Sources Gateway Configuration Network Security Groups Security Result After the cluster created, you should have access to the following screen in Cloudbreak: You can now access Ambari via the link provided, and CDSW using http://[CDSW_WORKER_PUBLIC_IP].xip.io

deepbaldha · ‎06-13-2019

from where to get both the imported files?

pvidal · ‎02-13-2019

I have been playing quite a bit with CDSW lately. Here is a quick article on how to setup a CDSW project in scala connecting to an external RDBMS Step 1: Create a new CDSW Project Using the CDSW UI, create a new Scala Project: Step 2: Reference the external Jar in your spark-defaults.conf Open your project, and edit your spark-defaults.conf to add an external jar: spark.jars=http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.6/mysql-connector-java-5.1.6.jar Step 3: Create a simple Scala file to connect to the DB Create a new file and add this code in it: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://[YOUR_SERVER_IP]:3306/[YOUR_DB]").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "[YOUR_TABLE]").option("user", "[YOUR_USER]").option("password", "[YOUR_PWD]").load() df.show() Step 4: Run your application Launch a session and run your code:

pvidal · ‎01-31-2019

Introduction Finally, we get to put all this together! This is less of a tutorial, and more of a show case of the capabilities of this solution. Please refere to previous articles to build this on your own. The meat of this article is showcasing the React/NodeJS application you can build to communicate with Nifi, Cloudbreak and your ephemeral cluster running Zeppelin and Spark. Architecture Below is a high-level architecture of the solution: You can find all the code for this here: https://github.com/paulvid/bmq-app The code includes: Latest Zeppelin Prediction Notebook All Nifi templates All Blueprints/Recipes for Cloudbreak The source code of the app Agenda This article is divided in the following sections: Section 1: Monitoring your current data with Nifi Section 2: Launching an ephemeral cluster with Cloudbreak Section 3: Running a prediction model with Spark Zeppelin Section 1: Monitoring your current data with Nifi Section 2: Launching an ephemeral cluster with Cloudbreak Section 3: Running a prediction model with Spark Zeppelin

pvidal · ‎01-31-2019

Introduction It's been a while, but I'm finally finishing this series on Best Mode Quotient. This article is fairly straight forward as we look how to build the infrastructure of model training using Zeppelin and Spark. Pre-Requisites In order to run this install, you will have to have deployed a BMQ ephemeral cluster as detailed in this article and this repository. Moreover you will have to have gathered data on your fitness as explained in part 1 of these series. Based on this data, I created 3 indexes: Intensity Index: How intense was the workout I did that day (based on distance, pace and elevation) Fatigue Index: How much rest I had on that day (based on hours of sleep and rest heart rate) BMQ Index: How my BMQ fairs compare to the max BMQ (5) You can then agglomerate these 3 indexes, having BMQ and Fatigue on the same day correlating to the Intensity Index of the previous day. All data will be shared in the last article of the series. You will also notice that the model uses parameterized sleep and rest HR. The whole flow is to be revealed in part 4 🙂 Agenda This tutorial is divided in the following sections: Section 1: Create a mysql interpreter for JDBC in Zeppelin Section 2: Create training set for BMQ prediction Section 3: Create a prediction model Section 4: Save the results in a table Section 1: Create a mysql interpreter for JDBC in Zeppelin Login to Zeppelin, then go to top right corner > Admin > Interpreter and edit the jdbc interpreter. Add the following parameters: mysql.driver: com.mysql.jdbc.Driver mysql.url: jdbc:mysql://localhost:3306/beast_mode_db mysql.user: bmq_user mysql.password: Be@stM0de Add artifact: mysql:mysql-connector-java:5.1.38 Restart the interpreter and you should be good to go. Section 2: Create training set for BMQ prediction Create a new note called BMQ Predictions and add the following code, using the jdbc interpreter you just built. Delete existing training tables if any %jdbc(mysql) drop table if exists training_set Create training set based on fatigue and intensity indexes %jdbc(mysql) create table training_set as ( select @rowid:=@rowid+1 as rowid, bmq_index.date, bmq_index, fatigue_index, intensity_index from bmq_index, fatigue_index, intensity_index, (select @rowid:=0) as init where bmq_index.date = fatigue_index.date and date_sub(bmq_index.date, INTERVAL 1 DAY) = intensity_index.date order by bmq_index.date asc) View Data %jdbc(mysql) select * from training_set Delete existing prediction tables if any %jdbc(mysql) drop table if exists prediction Create a table we want to apply the algo against %jdbc(mysql) create table prediction as ( select date(training_date) as date, estimated_intensity_index, round(( (1-((select (sleep_hours*60) from PREDICTION_PARAMETERS)/(select max(TOTAL_MINUTES_ASLEEP) from SLEEP_HISTORY)))*0.6 + (1-((select min(REST_HR) from HEALTH_HISTORY)/(select rest_hr from PREDICTION_PARAMETERS)))*0.4 ) *100,2) as estimated_fatigue_index, 0.0 as predicted_bmq from training_plan) View Data %jdbc(mysql) select * from prediction Section 3: Create a prediction model DISCLAIMER: This model needs to be worked on; the purpose of this article is to establish the principal architecture, not give the final most tuned model as I plan on improving on it. This part uses the spark interpreter of Zeppelin to vectorize, normalize and train a model Create dataframe from MySQL tables: %spark2 import org.apache.spark.sql.SQLContext import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.ml.feature.Interaction import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.feature.Normalizer val sqlcontext = new org.apache.spark.sql.SQLContext(sc) val df = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/beast_mode_db").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "training_set").option("user", "bmq_user").option("password", "Be@stM0de").load() df.show() %spark2 val target_df = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/beast_mode_db").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "prediction").option("user", "bmq_user").option("password", "Be@stM0de").load() target_df.show() Vectorize Dataframes val assembler1 = new VectorAssembler(). setInputCols(Array( "fatigue_index","intensity_index")). setOutputCol("features"). transform(df) assembler1.show() %spark2 val assembler2 = new VectorAssembler(). setInputCols(Array( "estimated_fatigue_index","estimated_intensity_index")). setOutputCol("features"). transform(target_df) assembler2.show() Normalize Dataframes %spark2 val normalizer = new Normalizer() .setInputCol("features") .setOutputCol("normFeatures") .setP(2.0) .transform(assembler1) normalizer.show() %spark2 val targetNormalizer = new Normalizer() .setInputCol("features") .setOutputCol("normFeatures") .setP(2.0) .transform(assembler2) targetNormalizer.show() Train and evaluate Model %spark2 val Array(trainingData, testData) = normalizer.randomSplit(Array(0.7, 0.3)) %spark2 val lr = new LinearRegression() .setLabelCol("bmq_index") .setFeaturesCol("normFeatures") .setMaxIter(10) .setRegParam(1.0) .setElasticNetParam(1.0) val lrModel = lr.fit(trainingData) lrModel.transform(testData).select("features","normFeatures", "bmq_index", "prediction").show() %spark2 println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}") %spark2 val trainingSummary = lrModel.summary println(s"numIterations: ${trainingSummary.totalIterations}") println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]") trainingSummary.residuals.show() println(s"RMSE: ${trainingSummary.rootMeanSquaredError}") println(s"r2: ${trainingSummary.r2}") %spark2 val targetTable = lrModel.transform(targetNormalizer).select("date", "estimated_intensity_index", "estimated_fatigue_index", "prediction") targetTable.show() Section 4: Save the results in a table Finally, we will take the results of the prediction and put them in a table called BMQ_PREDICTIONS for later use: Rename the dataframe to match target columns name %spark2 val newNames = Seq("date", "estimated_intensity_index", "estimated_fatigue_index", "predicted_bmq") val targetTableRenamed = targetTable.toDF(newNames: _*) Delete target table if exists %jdbc(mysql) drop table if exists BMQ_PREDICTIONS Write data %spark2 val prop = new java.util.Properties prop.setProperty("driver", "com.mysql.jdbc.Driver") prop.setProperty("user", "bmq_user") prop.setProperty("password", "Be@stM0de") targetTableRenamed.write.mode("append").jdbc("jdbc:mysql://localhost:3306/beast_mode_db", "BMQ_PREDICTIONS", prop) View data %jdbc(mysql) select * from BMQ_PREDICTIONS

pvidal · ‎01-17-2019

Here is a self explanatory tutorial. The official Cloudbreak documentation explains how to retrieve your Cloudbreak token via curl or python (see here). I have been playing with Node.JS on my side, so here is how to do it in Node.JS using request: process.env["NODE_TLS_REJECT_UNAUTHORIZED"] = 0; var request = require("request"); var options = { method: 'POST', url: 'https://[YOUR_CB_URL]/identity/oauth/authorize', qs: { response_type: 'token', client_id: 'cloudbreak_shell', 'scope.0': 'openid', source: 'login', redirect_uri: 'http://cloudbreak.shell', accept: 'application/x-www-form-urlencoded' }, headers: { 'Content-Type': 'application/x-www-form-urlencoded', accept: 'application/x-www-form-urlencoded' }, body: 'credentials={"username":"[YOUR_USER]","password":"[YOUR_PASSWORD]"' }; request(options, function (error, response, body) { if (error) throw new Error(error); const querystring = require('querystring'); console.log(querystring.parse(response.headers['location'])['access_token']); });

pvidal · ‎01-04-2019

DISCLAIMER: This is a method that is not recommended for any production use, only for development purposes. Only Cloudbreak official versions are supported, and upgrading to a non-supported version could cause unforeseen issues, and loss of official Hortonworks support. Moreover, this method only works for upgrades, not downgrades; downgrades would result in data loss. Recently, I have been playing with Cloudbreak and have been needing to upgrade my local version. While it follows the principles detailed in Hortonworks documentation, I thought I'd share a quick step by step guide on how to upgrade to any version of Cloudbreak. 1. Stop Cloudbreak on your machine: cbd kill 1. Go to https://mvnrepository.com/artifact/com.sequenceiq/cloudbreak and find the version you need (e.g. 2.8.1-rc.48) 2. On the VM where Cloudbreak is running, navigate to the directory where your Profile file is located. For example: cd /var/lib/cloudbreak-deployment/ 2. Run the following commands to download the binary: export CBD_VERSION=2.8.1-rc.48 curl -Ls public-repo-1.hortonworks.com/HDP/cloudbreak/cloudbreak-deployer_${CBD_VERSION}_$(uname)_x86_64.tgz | tar -xz -C /bin cbd 3. Verify the version: cbd version 4. Regenerate assets: cbd regenerate 5. Restart Cloudbreak: cbd start

pvidal · ‎12-21-2018

Introduction Continuing my series on Beast Mode Quotient, let's automate the creation and termination of data science ready clusters. As always, since this is step 2 of a series of article, this tutorial depends on my previous article. Pre-Requisites In order to run this tutorial you will need to have a CB instance available. There are plenty of good tutorial out there, I recommend this one Agenda This tutorial is divided in the following sections: Section 1: Create a blueprints & recipes to run a minimal data science ephemeral cluster Section 2: Add blueprint and recipes via Cloudbreak interface Section 3: Automate cluster launch and terminate clusters Section 1: Create a blueprints & recipes to run a minimal data science ephemeral cluster A Cloudbreak blueprint has 3 parts: Part 1: Blueprint details Part 2: Services Configuration Part 3: Host Components configuration Here are the details of each part for our blueprint. Blueprint details This part is fairly simple; we want to run an HDP 3.1 cluster, and I'm naming it bmq-data-science. "Blueprints": { "blueprint_name": "bmq-data-science", "stack_name": "HDP", "stack_version": "3.1" } Host Components configuration Similarly, I configured one Host with all the components needed for my services "host_groups": [ { "name": "master", "cardinality": "1", "components": [ { "name": "ZOOKEEPER_SERVER" }, { "name": "NAMENODE" }, { "name": "SECONDARY_NAMENODE" }, { "name": "RESOURCEMANAGER" }, { "name": "HISTORYSERVER" }, { "name": "APP_TIMELINE_SERVER" }, { "name": "LIVY2_SERVER" }, { "name": "SPARK2_CLIENT" }, { "name": "SPARK2_JOBHISTORYSERVER" }, { "name": "ZEPPELIN_MASTER" }, { "name": "METRICS_GRAFANA" }, { "name": "METRICS_MONITOR" }, { "name": "DATANODE" }, { "name": "HIVE_SERVER" }, { "name": "HIVE_METASTORE" }, { "name": "HIVE_CLIENT" }, { "name": "YARN_CLIENT" }, { "name": "HDFS_CLIENT" }, { "name": "ZOOKEEPER_CLIENT" }, { "name": "TEZ_CLIENT" }, { "name": "NODEMANAGER" }, { "name": "MAPREDUCE2_CLIENT" } ] } ] Services Configuration For our purposes, I want to create a minimum cluster that will run YARN, HDFS, HIVE, SPARK and ZEPPELIN (plus all the necessary compute engines behind it). I therefore configured these services according to the Cloudbreak examples that are available in the default Cloudbreak blueprints: "configurations": [ { "yarn-site": { "properties": { "yarn.nodemanager.resource.cpu-vcores": "6", "yarn.nodemanager.resource.memory-mb": "23296", "yarn.scheduler.maximum-allocation-mb": "23296" } } }, { "core-site": { "properties_attributes": {}, "properties": { "fs.s3a.threads.max": "1000", "fs.s3a.threads.core": "500", "fs.s3a.max.total.tasks": "1000", "fs.s3a.connection.maximum": "1500" } } }, { "capacity-scheduler": { "properties": { "yarn.scheduler.capacity.root.queues": "default", "yarn.scheduler.capacity.root.capacity": "100", "yarn.scheduler.capacity.root.maximum-capacity": "100", "yarn.scheduler.capacity.root.default.capacity": "100", "yarn.scheduler.capacity.root.default.maximum-capacity": "100" } } }, { "spark2-defaults": { "properties_attributes": {}, "properties": { "spark.sql.hive.hiveserver2.jdbc.url": "jdbc:hive2://%HOSTGROUP::master%:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2", "spark.sql.hive.hiveserver2.jdbc.url.principal": "hive/_HOST@EC2.INTERNAL", "spark.datasource.hive.warehouse.metastoreUri": "thrift://%HOSTGROUP::master%:9083", "spark.datasource.hive.warehouse.load.staging.dir": "/tmp", "spark.hadoop.hive.zookeeper.quorum": "%HOSTGROUP::master%:2181" } } }, { "hive-site": { "hive.metastore.warehouse.dir": "/apps/hive/warehouse", "hive.exec.compress.output": "true", "hive.merge.mapfiles": "true", "hive.server2.tez.initialize.default.sessions": "true", "hive.server2.transport.mode": "http", "hive.metastore.dlm.events": "true", "hive.metastore.transactional.event.listeners": "org.apache.hive.hcatalog.listener.DbNotificationListener", "hive.repl.cm.enabled": "true", "hive.repl.cmrootdir": "/apps/hive/cmroot", "hive.repl.rootdir": "/apps/hive/repl" } }, { "hdfs-site": { "properties_attributes": {}, "properties": { } } } ] You can find the complete blueprint and other recipes on my github, here Creating a recipe Recipes in Cloubreak are very useful and allow you to run scripts before a cluster is launched or after. In this example, I created a PRE-AMBARI-START recipe that creates the appropriate Postgres and MySQL services on my master box, as well as recreating the DB for my BMQ analysis: #!/bin/bash # Cloudbreak-2.7.2 / Ambari-2.7.0 - something is install pgsq95 yum remove -y postgresql95* # Install pgsql96 yum install -y https://download.postgresql.org/pub/repos/yum/9.6/redhat/rhel-7-x86_64/pgdg-redhat96-9.6-3.noarch.rpm yum install -y postgresql96-server yum install -y postgresql96-contrib /usr/pgsql-9.6/bin/postgresql96-setup initdb sed -i 's,#port = 5432,port = 5433,g' /var/lib/pgsql/9.6/data/postgresql.conf systemctl enable postgresql-9.6.service systemctl start postgresql-9.6.service yum remove -y mysql57-community* yum remove -y mysql56-server* yum remove -y mysql-community* rm -Rvf /var/lib/mysql yum install -y epel-release yum install -y libffi-devel.x86_64 ln -s /usr/lib64/libffi.so.6 /usr/lib64/libffi.so.5 yum install -y mysql-connector-java* ambari-server setup --jdbc-db=mysql --jdbc-driver=/usr/share/java/mysql-connector-java.jar if [ $(cat /etc/system-release|grep -Po Amazon) == Amazon ]; then yum install -y mysql56-server service mysqld start else yum localinstall -y https://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm yum install -y mysql-community-server systemctl start mysqld.service fi chkconfig --add mysqld chkconfig mysqld on ln -s /usr/share/java/mysql-connector-java.jar /usr/hdp/current/hive-client/lib/mysql-connector-java.jar ln -s /usr/share/java/mysql-connector-java.jar /usr/hdp/current/hive-server2-hive2/lib/mysql-connector-java.jar mysql --execute="CREATE DATABASE beast_mode_db DEFAULT CHARACTER SET utf8" mysql --execute="CREATE USER 'bmq_user'@'localhost' IDENTIFIED BY 'Be@stM0de'" mysql --execute="CREATE USER 'bmq_user'@'%' IDENTIFIED BY 'Be@stM0de'" mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'localhost'" mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'%'" mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'localhost' WITH GRANT OPTION" mysql --execute="GRANT ALL PRIVILEGES ON beast_mode_db.* TO 'bmq_user'@'%' WITH GRANT OPTION" mysql --execute="FLUSH PRIVILEGES" mysql --execute="COMMIT" Section 2: Add blueprint and recipes via Cloudbreak interface This part is super simple. Follow the User interface to load the files you just created, as depicted below Adding a blueprint Adding a recipe Section 3: Automate cluster launch and terminate clusters This is where the fun begins. For this part I created two scripts for launching and terminating the ephemeral cluster (that will then be called by the BMQ app). Both scripts rely on the cb cli that you can download from your CB instance: Launch Cluster Script As you will see below, the script is divided in the following part: Part 1: Reference the location of the cb-cli for ease of use Part 2: Dumps the content of my long lasting cluster to a recipe that will load them Part 3: Use CB api to add the recipe to my CB instance Part 4: Launch the cluster via cb-cli #!/bin/bash ############################### # 0. Initializing environment # ############################### export PATH=$PATH:/Users/pvidal/Documents/Playground/cb-cli/ ################################################### # 1. Dumping current data and adding it to recipe # ################################################### rm -rf poci-bmq-data-science.sh >/dev/null 2>&1 echo "mysql -u bmq_user -pBe@stM0de beast_mode_db --execute=\"""$(mysqldump -u bm_user -pHWseftw33# beast_mode_db 2> /dev/null)""\"" >> poci-bmq-data-science.sh ################################## # 2. Adding recipe to cloudbreak # ################################## TOKEN=$(curl -k -iX POST -H "accept: application/x-www-form-urlencoded" -d 'credentials={"username":"pvidal@hortonworks.com","password":"HWseftw33#"}' "https://192.168.56.100/identity/oauth/authorize?response_type=token&client_id=cloudbreak_shell&scope.0=openid&source=login&redirect_uri=http://cloudbreak.shell" | grep location | cut -d'=' -f 3 | cut -d'&' -f 1) echo $TOKEN ENCODED_RECIPE=$(base64 poci-bmq-data-science.sh) curl -X DELETE https://192.168.56.100/cb/api/v1/recipes/user/poci-bmq-data-science -H "Authorization: Bearer $TOKEN" -k curl -X POST https://192.168.56.100/cb/api/v1/recipes/user -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -H 'cache-control: no-cache' -d " { \"name\": \"poci-bmq-data-science\", \"description\": \"Recipe loading BMQ data post BMQ cluster launch\", \"recipeType\": \"POST_CLUSTER_INSTALL\", \"content\": \"$POST_CLUSTER_INSTALL\" }" -k ######################## # 3. Launching cluster # ######################## cb cluster create --cli-input-json tp-bmq-data-science.json --name bmq-data-science-$(date +%s) Terminate Cluster Script This script is much simpler; it uses cb cli to list the clusters running and terminate them: #!/bin/bash ############################### # 0. Initializing environment # ############################### export PATH=$PATH:/Users/pvidal/Documents/Playground/cb-cli/ ################################################ # 1. Get a list of clusters and terminate them # ################################################ cb cluster list | grep Name | awk -F \" '{print $4}' | while read cluster; do echo "Terminating ""$cluster""..." cb cluster delete --name $cluster done Conclusion With this framework, I'm able to launch and terminate clusters in a matter of minutes, as depicted below. Next step will be to model a way to calculate accurate prediction for BMQ!

pvidal · ‎12-10-2018

Introduction Cloudbreak documentation is fairly thorough and can be found on Hortonworks' documentation portal (e.g. here). However, if you are as lazy as I am, you'll appreciate the few quick examples I'm listing in this article. First and foremost, you can access the documentation of available APIs on your CB instance by using the following URL https://[YOUR_CB_URL]/cb/apidocs ; you should see something like this: Authenticating Authentication in cloudbreak is OAuth 2, and getting a detailed in the Cloudbreak API. Here is simple curl call to get a token: curl -k -iX POST -H "accept: application/x-www-form-urlencoded" -d 'credentials={"username":"[YOUR_USER]","password":"[YOUR_PWD]"}' "https://[YOUR_CB_URL]/identity/oauth/authorize?response_type=token&client_id=cloudbreak_shell&scope.0=openid&source=login&redirect_uri=http://cloudbreak.shell" | grep location | cut -d'=' -f 3 | cut -d'&' -f 1 GET Example: Listing all blueprints The biggest trick to the CB API is to know the URL to use and to remember to disable SSL Certificate Verification (using the -k in curl). Here is an example call that lists all blueprints: TOKEN=$(curl -k -iX POST -H "accept: application/x-www-form-urlencoded" -d 'credentials={"username":"[YOUR_USER]","password":"[YOUR_PWD]"}' "https://[YOUR_CB_URL]/identity/oauth/authorize?response_type=token&client_id=cloudbreak_shell&scope.0=openid&source=login&redirect_uri=http://cloudbreak.shell" | grep location | cut -d'=' -f 3 | cut -d'&' -f 1) curl -X GET https://YOUR_CB_URL]/cb/api/v1/blueprints/account -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -H 'cache-control: no-cache' -k POST Example: Adding an mpack TOKEN=$(curl -k -iX POST -H "accept: application/x-www-form-urlencoded" -d 'credentials={"username":"[YOUR_USER]","password":"[YOUR_PWD]"}' "https://[YOUR_CB_URL]/identity/oauth/authorize?response_type=token&client_id=cloudbreak_shell&scope.0=openid&source=login&redirect_uri=http://cloudbreak.shell" | grep location | cut -d'=' -f 3 | cut -d'&' -f 1) curl -X POST https://[YOUR_CB_URL]/cb/api/v1/mpacks/account -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -H 'cache-control: no-cache' -d '{ "name": "hdf-3-2-aws", "description": "HDF 3.2 Mangement Pack for AWS", "mpackUrl": "http://public-repo-1.hortonworks.com/HDF/amazonlinux2/3.x/updates/3.2.0.0/tars/hdf_ambari_mp/hdf-ambari-mpack-3.2.0.0-520.tar.gz", "purge": false, "force": false }' -k POST Example: Uploading a user blueprint To upload your own blueprints, remember to encode your blueprint in base64 TOKEN=$(curl -k -iX POST -H "accept: application/x-www-form-urlencoded" -d 'credentials={"username":"[YOUR_USER]","password":"[YOUR_PWD]"}' "https://[YOUR_CB_URL]/identity/oauth/authorize?response_type=token&client_id=cloudbreak_shell&scope.0=openid&source=login&redirect_uri=http://cloudbreak.shell" | grep location | cut -d'=' -f 3 | cut -d'&' -f 1) ENCODED_BLUEPRINT=$(base64 [YOUR_BP_JSON_LOCATION]) curl -X POST https://[YOUR_CB_URL]/cb/api/v1/blueprints/user -H "Authorization: Bearer $TOKEN" -H 'Content-Type: application/json' -H 'cache-control: no-cache' -d " { \"ambariBlueprint\": \"$ENCODED_BLUEPRINT\", \"description\": \"Blueprint for HWX Data Science Uploaded\", \"inputs\": [], \"tags\": {}, \"name\": \"data-science-workshop-upload\", \"hostGroupCount\": 1, \"status\": \"USER_MANAGED\", \"public\": true }" -k

pvidal · ‎11-07-2018

Introduction Before being able to predict one's Beast Mode Quotient, we must find an easy way to ingest data from multiple different sources, in this case flat files and APIs. Luckily for everyone in the world, HDF, specifically Nifi is designed to handle multiple data inputs and outputs extremely efficiently, as I will detail in this tutorial. A few notes about the choices of technologies selected for this tutorial and series of article: Target - MySQL: I selected MySQL because this is what my BMQ app is going to use, but it could be any target depending on your use-case. Sources - Fitbit, Strava, MyFitnessPal: There are a bunch of options out there to track health. Fitbit and Strava are the only ones offering a comprehensive and public API (looking at you Garmin, and Apple Health; both have no public APIs, and make it not easy to extract data via file, either having to parse .fit files or only exporting the full set of data); as for MyFitnessPal, I could have used Fitbit API (after sync with MFP) to get Calories data, but I figured it would be nice to demonstrate file ingestion. Notions you will learn in this tutorial Here are some high level concepts highlighted in this article, that you can re-apply to any implementation: Running MySQL queries including create, upsert and selection of date series Using Nifi processors maintaining state Configuring schema registry for file format conversion in Nifi (e.g CSV > JSON) Using Nifi to call OAuth 2 APIs (using refresh token, bearer tokens, etc.) Implementing and using a distributed map cache in Nifi Parsing and splitting JSONs in Nifi Implementing simple math operations using Nifi language Agenda This tutorial is divided in the following sections: Section 1: Setting up a MySQL DB to host health and fitness data Section 2: Creating a Nifi flow to consume MyFitnessPal CSV exports Section 3: Creating a Nifi flow to consume Fitbit health & sleep data Section 4: Creating a Nifi flow to consume Strava activity data Section 1: Setting up a MySQL DB to host health and fitness data Step 1: Create the MySQL DB and users Before being able to do anything, you will have to create a MySQL instance. There are a bunch of tutorials out there that explain you how to do it, but here is one for centos 7 for instance. Once you have this data setup connect as root and run the following database creation script: DROP DATABASE IF EXISTS `BEAST_MODE_DB`; CREATE DATABASE `BEAST_MODE_DB`; CREATE USER 'bm_user'@'%' IDENTIFIED BY '[YOUR_PASSWORD]'; GRANT ALL ON BEAST_MODE_DB.* TO 'bm_user'@'%' IDENTIFIED BY '[YOUR_PASSWORD]'; Step 2: Create the DB tables Connect as bm_user and run this script to create the appropriate tables: -- MySQL dump 10.14 Distrib 5.5.60-MariaDB, for Linux (x86_64) -- -- Host: localhost Database: BEAST_MODE_DB -- ------------------------------------------------------ -- Server version 5.5.60-MariaDB /*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */; /*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */; /*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */; /*!40101 SET NAMES utf8 */; /*!40103 SET @OLD_TIME_ZONE=@@TIME_ZONE */; /*!40103 SET TIME_ZONE='+00:00' */; /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */; /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */; /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */; /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */; USE BEAST_MODE_DB; -- -- Table structure for table `ACTIVITY_HISTORY` -- DROP TABLE IF EXISTS `ACTIVITY_HISTORY`; /*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `ACTIVITY_HISTORY` ( `ID` bigint(20) NOT NULL, `START_TIME` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, `DURATION` double DEFAULT NULL, `DISTANCE` double DEFAULT NULL, `ELEVATION_GAIN` double DEFAULT NULL, `AVG_HR` double DEFAULT NULL, `AVG_PACE` double DEFAULT NULL, PRIMARY KEY (`ID`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; /*!40101 SET character_set_client = @saved_cs_client */; -- -- Table structure for table `BMQ_HISTORY` -- DROP TABLE IF EXISTS `BMQ_HISTORY`; /*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `BMQ_HISTORY` ( `ID` mediumint(9) NOT NULL AUTO_INCREMENT, `BMQ` int(11) NOT NULL, `TYPE` char(30) DEFAULT NULL, `TIME_ENTERED` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`ID`) ) ENGINE=InnoDB AUTO_INCREMENT=30 DEFAULT CHARSET=utf8; /*!40101 SET character_set_client = @saved_cs_client */; -- -- Table structure for table `GENERATOR` -- DROP TABLE IF EXISTS `GENERATOR`; /*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `GENERATOR` ( `ID` mediumint(9) NOT NULL AUTO_INCREMENT, `VALUE` varchar(50) DEFAULT NULL, PRIMARY KEY (`ID`) ) ENGINE=InnoDB AUTO_INCREMENT=161 DEFAULT CHARSET=latin1; /*!40101 SET character_set_client = @saved_cs_client */; -- -- Dumping data for table `GENERATOR` -- LOCK TABLES `GENERATOR` WRITE; /*!40000 ALTER TABLE `GENERATOR` DISABLE KEYS */; INSERT INTO `GENERATOR` VALUES (1,'dummy'),(2,'dummy'),(3,'dummy'),(4,'dummy'),(5,'dummy'),(6,'dummy'),(7,'dummy'),(8,'dummy'),(9,'dummy'),(10,'dummy'),(11,'dummy'),(12,'dummy'),(13,'dummy'),(14,'dummy'),(15,'dummy'),(16,'dummy'),(17,'dummy'),(18,'dummy'),(19,'dummy'),(20,'dummy'),(21,'dummy'),(22,'dummy'),(23,'dummy'),(24,'dummy'),(25,'dummy'),(26,'dummy'),(27,'dummy'),(28,'dummy'),(29,'dummy'),(30,'dummy'),(31,'dummy'),(32,'dummy'),(33,'dummy'),(34,'dummy'),(35,'dummy'),(36,'dummy'),(37,'dummy'),(38,'dummy'),(39,'dummy'),(40,'dummy'),(41,'dummy'),(42,'dummy'),(43,'dummy'),(44,'dummy'),(45,'dummy'),(46,'dummy'),(47,'dummy'),(48,'dummy'),(49,'dummy'),(50,'dummy'),(51,'dummy'),(52,'dummy'),(53,'dummy'),(54,'dummy'),(55,'dummy'),(56,'dummy'),(57,'dummy'),(58,'dummy'),(59,'dummy'),(60,'dummy'),(61,'dummy'),(62,'dummy'),(63,'dummy'),(64,'dummy'),(65,'dummy'),(66,'dummy'),(67,'dummy'),(68,'dummy'),(69,'dummy'),(70,'dummy'),(71,'dummy'),(72,'dummy'),(73,'dummy'),(74,'dummy'),(75,'dummy'),(76,'dummy'),(77,'dummy'),(78,'dummy'),(79,'dummy'),(80,'dummy'),(81,'dummy'),(82,'dummy'),(83,'dummy'),(84,'dummy'),(85,'dummy'),(86,'dummy'),(87,'dummy'),(88,'dummy'),(89,'dummy'),(90,'dummy'),(91,'dummy'),(92,'dummy'),(93,'dummy'),(94,'dummy'),(95,'dummy'),(96,'dummy'),(97,'dummy'),(98,'dummy'),(99,'dummy'),(100,'dummy'),(101,'dummy'),(102,'dummy'),(103,'dummy'),(104,'dummy'),(105,'dummy'),(106,'dummy'),(107,'dummy'),(108,'dummy'),(109,'dummy'),(110,'dummy'),(111,'dummy'),(112,'dummy'),(113,'dummy'),(114,'dummy'),(115,'dummy'),(116,'dummy'),(117,'dummy'),(118,'dummy'),(119,'dummy'),(120,'dummy'),(121,'dummy'),(122,'dummy'),(123,'dummy'),(124,'dummy'),(125,'dummy'),(126,'dummy'),(127,'dummy'),(128,'dummy'),(129,'dummy'),(130,'dummy'),(131,'dummy'),(132,'dummy'),(133,'dummy'),(134,'dummy'),(135,'dummy'),(136,'dummy'),(137,'dummy'),(138,'dummy'),(139,'dummy'),(140,'dummy'),(141,'dummy'),(142,'dummy'),(143,'dummy'),(144,'dummy'),(145,'dummy'),(146,'dummy'),(147,'dummy'),(148,'dummy'),(149,'dummy'),(150,'dummy'),(151,'dummy'),(152,'dummy'),(153,'dummy'),(154,'dummy'),(155,'dummy'),(156,'dummy'),(157,'dummy'),(158,'dummy'),(159,'dummy'),(160,'dummy'); /*!40000 ALTER TABLE `GENERATOR` ENABLE KEYS */; UNLOCK TABLES; -- -- Table structure for table `HEALTH_HISTORY` -- DROP TABLE IF EXISTS `HEALTH_HISTORY`; /*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `HEALTH_HISTORY` ( `DIARY_DAY` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, `TOTAL_CALORIES_OUT` double NOT NULL, `TOTAL_MINUTES_RECORDED` double NOT NULL, `STEPS` double DEFAULT NULL, `ELEVATION` double DEFAULT NULL, `AVG_REST_HR` double DEFAULT NULL, `REST_MINUTES` double DEFAULT NULL, `REST_CAL_OUT` double DEFAULT NULL, `AVG_FAT_BURN_HR` double DEFAULT NULL, `FAT_BURN_MINUTES` double DEFAULT NULL, `FAT_BURN_CAL_OUT` double DEFAULT NULL, `AVG_CARDIO_HR` double DEFAULT NULL, `CARDIO_MINUTES` double DEFAULT NULL, `CARDIO_CAL_OUT` double DEFAULT NULL, `AVG_PEAK_HR` double DEFAULT NULL, `PEAK_MINUTES` double DEFAULT NULL, `PEAK_CAL_OUT` double DEFAULT NULL, PRIMARY KEY (`DIARY_DAY`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1; /*!40101 SET character_set_client = @saved_cs_client */; -- -- Table structure for table `NUTRITION_HISTORY` -- DROP TABLE IF EXISTS `NUTRITION_HISTORY`; /*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `NUTRITION_HISTORY` ( `DIARY_DAY` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, `MEAL` varchar(50) NOT NULL, `TOTAL_CALORIES_IN` double NOT NULL, `CARB` double NOT NULL, `PROT` double NOT NULL, `FAT` double NOT NULL, PRIMARY KEY (`DIARY_DAY`,`MEAL`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; /*!40101 SET character_set_client = @saved_cs_client */; -- -- Table structure for table `SLEEP_HISTORY` -- DROP TABLE IF EXISTS `SLEEP_HISTORY`; /*!40101 SET @saved_cs_client = @@character_set_client */; /*!40101 SET character_set_client = utf8 */; CREATE TABLE `SLEEP_HISTORY` ( `DIARY_DAY` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, `TOTAL_MINUTES_ASLEEP` double NOT NULL, `TOTAL_MINUTES_IN_BED` double NOT NULL, `REM_MINUTES` double NOT NULL, `LIGHT_MINUTES` double NOT NULL, `DEEP_MINUTES` double NOT NULL, `WAKE_MINUTES` double NOT NULL, PRIMARY KEY (`DIARY_DAY`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1; /*!40101 SET character_set_client = @saved_cs_client */; /*!40103 SET TIME_ZONE=@OLD_TIME_ZONE */; /*!40101 SET SQL_MODE=@OLD_SQL_MODE */; /*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */; /*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */; /*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */; /*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */; /*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */; /*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */; -- Dump completed on 2018-11-07 0:14:54 You will note a GENERATOR table has been created. This is necessary to create date series queries, as detailed later. Step 3: Update your BMQ every day To get your perceived BMQ, you will have to enter it manually. Before we develop the BMQ app to do that, you can enter it using the following example queries every day: insert into BMQ_HISTORY (BMQ, TYPE, TIME_ENTERED) values ('3', 'morning', '2018-11-06 7:00:00'); insert into BMQ_HISTORY (BMQ, TYPE, TIME_ENTERED) values ('3', 'pre-workout', '2018-11-06 7:50:00'); insert into BMQ_HISTORY (BMQ, TYPE, TIME_ENTERED) values ('4', 'post-workout', '2018-11-06 8:50:00'); insert into BMQ_HISTORY (BMQ, TYPE, TIME_ENTERED) values ('3', 'evening', now()); Section 2: Creating a Nifi flow to consume MyFitnessPal CSV exports The goal is to setup the following flow: Step 1: Survey folder for new CSV Files List Files in MFP Folder: ListFile Processor This processor will monitor a folder for new files, based on timestamp, then maintain the last file read in its state, as seen below after right-clicking on the processor and selecting view state: Here is a list of properties to configure that are not default: Input directory: [YOUR_LOCAL_PATH] File filter: .* Run Schedule: 12 h Get Latest File: FetchFile Processor This processor gets the file from the list generated previously and sends it in a flow file. It is configured using default properties. Step 2: Convert CSV to JSON Update Schema Name: UpdateAttribute Processor This processor updates the schema name attribute (later used by next processors and controller services). HortonworksSchemaRegistry Controller Service This controller service is pointing to your HDF Schema Registry. The only non-default property to configure in the controller service before enabling is: Schema Registry URL: http://[YOUR_SCHEMA_REGISTRY_HOST]:7788/api/v1/ CSVReader Controller Service This controller service is going to read your CSV and use your Hortonworks Schema Registry to parse its schema. The following non-default properties should be configured in the controller service before enabling is: Schema Access Strategy: Use 'Schema Name' Property Schema Registry: HortonworksSchemaRegistry Treat First Line as Header: true Ignore CSV Header Column Names: true JsonRecordSetWriter Controller Service This controller service is going to write your JSON output based on your Hortonworks Schema Registry. The following non-default properties should be configured in the controller service before enabling is: Schema Write Strategy: Set 'schema.name' Attribute Schema Access Strategy: Use 'Schema Name' Property Schema Registry: HortonworksSchemaRegistry Convert CSV to JSON: ConvertRecord Processor This processor does the record conversion. Configure it with your enabled controllers as such: Record Reader: CSVReader Record Writer: JsonRecordSetWriter Step 3: Get fields needed for insert SplitJson Processor This processor splits each meal in your export in a flow file, by using the following non-default property: JsonPath Expression: $.[*] Extract Relevant Attributes: EvaluateJsonPath Processor This processor extracts the relevant elements of your split JSON into attributes. It is configured using the following: Destination: flowfile-attribute carb: $.carbohydrates_g diary_day: $.date fat: $.fat_g meal: $.meal prot: $.protein_g total_calories_in: $.calories Step 4: Insert Data to MySQL Create Upsert Query: ReplaceText Processor This processor creates the query to be executed on your MySQL server to upsert the different lin. Here is the ReplacementValue you should use: INSERT INTO NUTRITION_HISTORY (DIARY_DAY, MEAL, TOTAL_CALORIES_IN, CARB, PROT, FAT) VALUES ('${diary_day}', '${meal}', ${total_calories_in}, ${carb}, ${prot}, ${fat}) ON DUPLICATE KEY UPDATE TOTAL_CALORIES_IN = ${total_calories_in}, CARB = ${carb}, PROT = ${prot}, FAT = ${fat} DBCPConnectionPool Controller Service This controller service is enabling connection to your MySQL server, configured by the following non-default properties: Database Connection URL: jdbc:mysql://[YOUR_MYSQL_SERVER_ADDRESS]:3306/BEAST_MODE_DB?useLegacyDatetimeCode=false&serverTimezone=America/New_York Database Driver Class Name: com.mysql.jdbc.Driver Database Driver Location(s): file:///home/nifi/mysql_jdbc_drivers/mysql-connector-java-8.0.11.jar (or wherever you put your mysql jdbc driver) Upsert into MySQL: PutSQL Processor This processor executes the query generated in the ReplaceText processor, and relies on your DBCPConnectionPool controller service, configured as such: JDBC Connection Pool: DBCPConnectionPool Section 3: Creating a Nifi flow to consume Fitbit health & sleep data The goal is to setup the following flow (separated in two screenshot, so you know it's going to be fun). It assumes that you registered an application with the Fitbit API and have a refresh/bearer token ready with the right permissions (see details over at Fitbit documentation). Step 1: Get Delta parameters Get Latest Date: ExecuteSQL Processor This processor runs a query to get a series of dates not yet in the BMQ DB in order to run the Fitbit API. It is relying on the configuration of DBCPConnectionPool detailed above (configured in the property Database Connection Pooling Service, and a 12 h Run Schedule). Here is the query it should run: select list_date from (select date_format( adddate((select IFNULL(max(hist.diary_day),'2018-11-07') from HEALTH_HISTORY hist), @num:=@num+1), '%Y-%m-%d' ) list_date from GENERATOR, (select @num:=-1) num limit 150) dates where list_date < CURRENT_DATE() I realize this is not the only method to generate such a list (using GenerateTableFetch, etc.). This makes sure that I only run what the DB needs if something else changes the DB. ConvertAvroToJSON Processor This processor converts the Avro response of the previous processor. Make sure you configure the following property to avoid splitting error when only one record is returned. Wrap Single Record: true SplitJson Processor This processor splits each record in the query in a flow file, by using the following non-default property: JsonPath Expression: $.* Extract Relevant Attributes: EvaluateJsonPath Processor This processor extracts the relevant elements of your split JSON into attributes. It is configured using the following: Destination: flowfile-attribute DIARY_DAY: $.list_date Step 2: Handle Token Refresh DistributedMapCacheServer Controller Service This controller service is the server that will run the distributed cache. I should be configured using all default properties before being enabled. DistributedMapCacheClientService Controller Service This controller service is connecting to the DistributedMapCacheServer configured previously. Before enabling it, just configure the following: Server Hostname: [YOUR_HOSTNAME] Get Token From Cache: FetchDistributedMapCache Processor This processor will try and fetch a refresh token from the DistributedMapCacheServer. If not found, we will send it to an UpdateAttribute to set a default value (e.g. for the first run), otherwise we will send it to the InvokeHTTP to get a new token. Here are the non-default properties to be configured: Cache Entry Identifier: refresh-token-fitbit Distributed Cache Service: DistributedMapCacheClientService Put Cache Value In Attribute: CurrentRefreshToken A few notes: This configuration requests a refresh token every run, which works with my 12 h run schedule. Ideally, you would store the expiration in distributed cache as well and only request a refresh token then, but I didn't want to overcomplicate things. Moreover, in a production environment, I would probably recommend to persist these token on disk and not only in memory in case of failure. Use Default Value if Not Found: UpdateAttribute Processor This processor updates the CurrentRefreshToken if FetchDistributedMapCache returns not-found. Configure it using this property: CurrentRefreshToken: [A_VALID_REFRESH_TOKEN] Refresh Fitbit Token: InvokeHTTP Processor This calls the Fitbit API to get a new valid token and a refresh token. Configure the following non-default properties to run it: HTTP Method: POST Remote URL: https://api.fitbit.com/oauth2/token?grant_type=refresh_token&refresh_token=${CurrentRefreshToken} Content-Type: application/x-www-form-urlencoded Authorization: Basic [YOUR_AUTHORIZATION_CODE] (see Fibit Documentation) Extract Token: EvaluateJsonPath Processor This processor extracts the bearer and refresh tokens from the HTTP response. It is configured using the following: Destination: flowfile-attribute carb: $.access_token diary_day: $.refresh_token Extract Token: EvaluateJsonPath Processor This processor extracts the bearer and refresh tokens from the HTTP response, then continues to two routes: one storing back the refresh token in cache, the other one using the bearer token to call the Fitbit API. It is configured using the following: Destination: flowfile-attribute carb: $.access_token diary_day: $.refresh_token Put Refresh Token in Flowfile: ReplaceText Processor This processor puts the refresh token in flowfile in order to be consumed: ${REFRESH_TOKEN} PutDistributedMapCache Processor This processor takes the refresh token in the flowfile and stores it under refresh-token-fitbit for next time the flow is executed: Cache Entry Identifier: refresh-token-fitbit Distributed Cache Service: DistributedMapCacheClientService Step 3: Call Fitbit APIs (Health and Sleep) For this step, I will only detail one processor. First, because, as you can see on the screenshot above, both Get FitBit Daily Summary are very similar (they are just calling diffrent endpoints of the fitbit API). Secondly, because the flow Extract Relevant Attributes > Create SQL Query > Upsert to MySQL has been described above. Finally, because I will keep the Calculate Averages and Sums for the next flow I run in Strava. All detailed processors can be found here: feed-fitbit-anonymized.xml Get FitBit Daily Summary: InvokeHTTP Processor This calls the Fitbit API to get a daily health summary: HTTP Method: GET Remote URL: https://api.fitbit.com/1/user/-/activities/date/${DIARY_DAY}.json Authorization: Bearer ${BEARER_TOKEN} Section 4: Creating a Nifi flow to consume Strava activity data As for the previous section, the goal is to setup a flow calling Strava's activity API. It assumes that you registered an application with the Strava API and have a refresh/bearer token ready with the right permissions (see details over at Strava documentation). As opposed with Fitbit the refresh token remains the same, so no need for distributed cache. I'm only going to detail the Convert Units processor to talk about Nifi language. The rest of the flow can be found here: feed-strava-anonymized.xml Convert Units: UpdateAttribute Processor This processor uses some of the data extracted to convert units before putting them into attributes later on used in your upsert query. Here are the calculation executed: AVG_HR: ${AVG_HR:isEmpty():ifElse(0.0, ${AVG_HR})} (makes sure that we have a value if the API does not return HR, in the case of HR not present in activity) AVG_PACE: ${MILE_IN_METERS:divide(${AVG_PACE:toDecimal():multiply(60.0)})} (converts speed from m/s to min/mile) DISTANCE: ${DISTANCE:divide(${MILE_IN_METERS})} (converts distance from meters to miles) DURATION: ${DURATION:toDecimal():divide(60.0)} (converts duration from seconds to mins) START_TIME: ${START_TIME:replaceAll("T"," "):replaceAll("Z","")} (converts timestamp format to MySQL format)

Online	Offline
Last Visited	‎01-13-2021 10:35 AM

Member Since	‎07-10-2018 02:51 PM
Last Visited	‎01-13-2021 10:35 AM
Posts	63
Kudos received	73

Cloudera Community

Edge to AI 101: Deploy HDP, HDF, and CDSW on Azure...

Re: Connect Nifi to Cloudera Hive with Kerberos an...

Connecting to an external RDBMS (MySQL) using scal...

Part 4: Create an end-to-end application automatin...

Part 3: Use Spark and Zeppelin to train fatigue pr...

Authenticate to Cloudbreak API using Node.JS

Upgrading Cloudbreak to any version

Beast Mode Quotient - Part 2: Create Cloudbreak bl...

Cloudbreak API 101

Beast Mode Quotient - Part 1: Create Nifi Flows to...