About akanto

akanto · ‎05-09-2017

hi @Marius Dieckmann, if you think that the answer was useful, then you might consider to accept it :). Thanks.

akanto · ‎05-04-2017

Hi, It does not seem to related to the version of OpenStack. It looks like that OS::Neutron::Router creation failed. Resource creation defined here: https://github.com/sequenceiq/cloudbreak/blob/master/cloud-openstack/src/main/resources/templates/openstack-heat.ftl#L66 . Floating IP pool is not created by Cloudbreak we just use that id what you specify during network creation: https://github.com/sequenceiq/cloudbreak/blob/master/cloud-openstack/src/main/resources/templates/openstack-heat.ftl#L158 May I ask you to try to deploy into an existing subnet. So create a network and a subnet manually and specify the network/subnet and floating pool ids when you create a network. Please see attached screenshot. Thanks, Attila

akanto · ‎04-09-2017

The intention of this article is to show how simple is to set up a HDP cluster which is suitable to execute neural network training on GPUs in a distributed fashion. Explanation of how distributed training works is beyond the scope of this article, but if you are interested you can read about that here. Training neural networks are computationally intensive, involves lots of matrix manipulations, such as multiplication, addition and element-wise operations, these calculations are highly parallelizible and can be significantly speed up by using GPUs instead of general purpose CPUs. Neural networks are often trained on very large datasets, where training data is too large to fit to one machine and therefore the data is partitioned across several machines. If the data is already on different machines, it might be desirable that during the training the data is not transferred across the computer network, but the processing is moved next to the data and train the neural network in a distributed fashion. This distributed machine learning tutorial uses the following key components: Deeplearning4j: open-source distributed ML Library Apache Spark: open-source cluster-computing framework HDCloud: tool for provisioning and managing HDP clusters on AWS Preparations HDCloud 1.14.1 has recently been released and it is required to launch a cluster for this tutorial , if you are not familiar with HDCloud, you can get started with HDCloud documentation that guides you through how to set up your HDCloud controller. Once your HDCloud controller is up then you can launch a cluster with the cli tool: wget https://raw.githubusercontent.com/akanto/ml/hcc-cuda-blog/scripts/test-dl4j.json hdc create-cluster --cli-input-json test-dl4j.json If you are not familiar with the cli you can can find detailed description here and here. The first command above downloads test-dl4j.json file, which is a template that describes your cluster and looks like this: { "ClusterName": "test-dl4j", "HDPVersion": "2.6", "ClusterType": "Data Science: Apache Spark 1.6, Apache Zeppelin 0.7.0", "Master": { "InstanceType": "g2.2xlarge", "VolumeType": "ephemeral", "VolumeSize": 60, "VolumeCount": 1, "RecoveryMode": "MANUAL", "Recipes": [ { "URI": "https://raw.githubusercontent.com/akanto/ml/hcc-cuda-blog/scripts/install-nvidia-driver.sh", "Phase": "post" }, { "URI": "https://raw.githubusercontent.com/akanto/ml/hcc-cuda-blog/scripts/ka-mnist.sh", "Phase": "post" } ] }, "Worker": { "InstanceType": "g2.2xlarge", "VolumeType": "ephemeral", "VolumeSize": 60, "VolumeCount": 1, "InstanceCount": 3, "RecoveryMode": "AUTO", "Recipes": [ { "URI": "https://raw.githubusercontent.com/akanto/ml/hcc-cuda-blog/scripts/install-nvidia-driver.sh", "Phase": "post" } ] }, "Compute": { "InstanceType": "g2.2xlarge", "VolumeType": "ephemeral", "VolumeSize": 60, "VolumeCount": 1, "InstanceCount": 0, "RecoveryMode": "AUTO", "Recipes": [ { "URI": "https://raw.githubusercontent.com/akanto/ml/hcc-cuda-blog/scripts/install-nvidia-driver.sh", "Phase": "post" } ] }, "SSHKeyName": "REPLACE-WITH-YOUR-KEY", "RemoteAccess": "0.0.0.0/0", "WebAccess": true, "HiveJDBCAccess": true, "ClusterComponentAccess": true, "ClusterAndAmbariUser": "admin", "ClusterAndAmbariPassword": "admin", "InstanceRole": "CREATE" } There are a few things what is worth to notice in this cluster template file: g2.2xlarge instances are used, since these machines are shipped with a high-performance Nvidia GPU, with 1536 CUDA cores and 4GB of video memory multiple custom scripts are executed as part of the cluster setup: install-nvidia-driver.sh is responsible to install the required Nvidia drivers on each node, ka-mnist.sh checks out the git repository that holds the example source code for training and evaluating the neural network SSHKeyName is just a placeholder, you need to add you own SSH key there After the cluster has successfully been created you need to restart the clusters in order to load the Nvidia kernel modules and libraries. Restart of all nodes simultaneously can be done by executing the following command on master node: sudo salt -G 'hostgroup:worker' cmd.run 'reboot' && sudo reboot If you are wondering how the former Salt command works can take a look at this article. After the restart is finished, ssh again to master node can verify whether Nvidia drivers and libraries are properly loaded: sudo salt '*' cmd.run 'nvidia-smi -q | head' sudo salt '*' cmd.run 'ldconfig -p | grep cublas' The result of the above command should look like this: ip-10-0-91-181.eu-west-1.compute.internal: ==============NVSMI LOG============== Timestamp : Sun Apr 9 10:58:52 2017 Driver Version : 352.99 Attached GPUs : 1 GPU 0000:00:03.0 Product Name : GRID K520 Product Brand : Grid ip-10-0-91-181.eu-west-1.compute.internal: libcublas.so.7.5 (libc6,x86-64) => /opt/nvidia/cuda/lib64/libcublas.so.7.5 libcublas.so (libc6,x86-64) => /opt/nvidia/cuda/lib64/libcublas.so Before continue, please also ensure that all of the HDP services are running after restart, you can track status of HDP services on Ambari UI. Execution The neural network in this tutorial is trained on MNIST dataset, which is a simple computer vision dataset and consists of images of handwritten digits. An image of a handwritten digit is 28 * 28 pixels large and looks like these: The MnistSpark is the main class for the neural network training and also for the evaluation. It creates a neural network with two hidden layers. The network has 784 (28 * 28) inputs and output layer has 10 neurons for the different classes (for the different digits). The implementation is based on a so called synchronous parameter averaging method provided by TrainingMaster class, which from birds perspective does the following job in a loop: distributes the global model parameters for each worker executes the training on each worker parallel but only on a subset of the data fetch the changed parameters from each worker, set the global parameters to the average of the parameters from each worker To see it in action just submit the Spark code as spark user: sudo su spark cd /home/spark/ml && ./submit.sh The submit.sh executes the training and also the evaluation and finally it prints out the evaluation result of the network on the test data set: ==========================Scores======================================== Accuracy: 0.9499 Precision: 0.9505 Recall: 0.9489 F1 Score: 0.9497 ======================================================================== While the Spark job is running you can check the GPU utilisation on worker nodes: sudo nvidia-smi -l 1 +------------------------------------------------------+ | NVIDIA-SMI 352.99 Driver Version: 352.99 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GRID K520 On | 0000:00:03.0 Off | N/A | | N/A 45C P0 50W / 125W | 646MiB / 4095MiB | 31% Default | +-------------------------------+----------------------+----------------------+ If you are interested, you can find more information about Deeplearning4j running on Spark here.

akanto · ‎02-14-2017

Cloudbreak launches new VMs. It cannot install HDP on existing VMs. If you don't need those old VM's then you can delete them. Attila

akanto · ‎02-03-2017

That blueprint is invalid, because: it shall not contain input{...} section You should not add hosts to it, since Cloudbreak will fill the host section automatically and post it not in blueprint, but in a separate cluster creation template Please find the updated blueprint attached. I tested it and it worked for me: fixed-hcc-blueprintjson.txt

akanto · ‎02-03-2017

Thanks, is it somehow possible to influence the default behaviour for the standard auto created hive views? E.g adding a special property to blueprint to force the auto instantiated views to connect to 10501? The only thing what I can think of is that I change the ports in the blueprint config. E.g. HIVE_SERVER to port 10501, and HIVE_SERVER_INTERACTIVE to port 10001. But it does not seem as the most elegant solution.

akanto · ‎02-03-2017

I have set up a HDP cluster which contains a HiveServer2 (listening on port 10001, using http transport mode) and HiveServer2 Interactive (LLAP, listening on port 10501). I am using Hive view in Ambari to execute my queries, but I am not able to figure out that which HiveServer2 is connected to the Hive view. The configuration and log of Hive view does not contain this information, it just displays that it connects to "Local Cluster" named "hiveclust" (see screenshot attached). Is there a way to figure out that which HiveServer2 is used by Hive view? Thanks.

akanto · ‎02-02-2017

hi @rahul gulati, For the first look it seems quite ok, although there is one thing what I noticed that MYSQL_SERVER is missing for Hive metastore from the 2.) hostgroup. By default Ambari can install MySQL for default db for hive metastore. For your reference I have attached the available services for HDP 2.5. But if you have a running ambari you can get from api/v1/stacks/HDP/versions/2.5/services?fields=components/StackServiceComponents Attila

akanto · ‎01-24-2017

If you are not setting the password on Cloudbreak directly, then the default password for every component is admin. If you wish, you can change the default password in the Advanced Options menu: http://sequenceiq.com/cloudbreak-docs/latest/aws/#advanced-options Attila

akanto · ‎01-16-2017

That would work. You can install Cloudbreak on that edge node if you are following the instructions described available in the documentation, of course that node must meet the prerequisites defined in the doc. Attila

Online	Offline
Last Visited	‎02-11-2019 08:25 AM

Member Since	‎10-14-2015 03:52 PM
Last Visited	‎02-11-2019 08:25 AM
Posts	65
Kudos received	57

Cloudera Community

Re: Cannot access AmbariUI after Cloudbreak instal...

Re: Cloudbreak 1.16 - Adding security breaks Ambar...

Re: Cloubreak installation fails

Re: Hadoop Cluster deployment in azure using cloud...

Re: Setting up 4 Node Cluster(1 Edge Node, 1 Maste...

Re: Cloubreak installation fails

Re: Cloubreak installation fails

Distributed training of neural networks on GPUs

Re: Hadoop Cluster deployment in azure using cloud...

Re: Setting up 4 Node Cluster(1 Edge Node, 1 Maste...

Re: Which HiveServer2 is used by Hive Views?

Which HiveServer2 is used by Hive Views?

Re: Setting up 4 Node Cluster(1 Edge Node, 1 Maste...

Re: what is the default password for hive user on ...

Re: Can Cloudbreak custom template be deployed on ...