Community Articles

Find and share helpful community-sourced technical articles.
avatar
Contributor

Introduction

 

Cloudera Data Platform Base doesn't have one Quickstart/Sandbox VM like the ones for CDH/HDP releases that helped a lot of people (including me), to learn more about the open-source components and also see the improvements from the community in CDP Runtime. 

 

The objective of this tutorial is to enable and create a VM from scratch via some automation (Shell Script and Cloudera Template) that can help whoever wants to use and/or learn Cloudera CDP in a Sandbox/Quickstart like environment in your machine.

 

Pre-Requisites

 

This exercise is performed on a Mac OS but you can install Vagrant/Virtualbox on Windows/Linux machines (https://www.vagrantup.com/docs/installation).

The versions below were tested at the moment of writing this blog and may change in the future.

 

The machine needs to have at least:

  • 80 GB of free disk space;
  • 12 GB free RAM;
  • 8 free VCPU;
  • Good internet connection;

Install Virtualbox and Vagrant

 

These are the software that we'll use to run our virtualized environment and to download and install Virtualbox and Vagrant execute the following commands in your host machine (For MAC OS):

 

For Mac

$ brew cask install virtualbox

$ brew cask install vagrant

$ brew cask install vagrant-manager

 

The manager is optional and can be used to manage your Virtual Machines on the menu bar.

 

For Windows

Download Virtualbox here and Vagrant here and install the files. Also, take a look at this instruction regarding hypervisor.

 

For Linux

Follow Virtualbox and Vagrant instructions to install in your Linux Version.

 

Step 1: Vagrant Centos 7 Virtual Machine Setup with CDP

 

  1. Download the Centos VM and the files necessary for set up in an empty folder. In this example, I'll download within the "~/cdpvm/" folder. Also, in your host machine execute the following commands:

 

$ cd ~
$ mkdir cdpvm
$ cd cdpvm
$ wget https://cloud.centos.org/centos/7/vagrant/x86_64/images/CentOS-7-x86_64-Vagrant-2004_01.VirtualBox.box 
$ wget https://raw.githubusercontent.com/carrossoni/CDPDCTrial/master/scripts/VMSetup.sh

 

  • Go to the folder that you've downloaded your VM file (cd ~/cdpvm) and initialize the Virtual Machine using the following command:

 

$ vagrant box add CentOS-7-x86_64-Vagrant-2004_01.VirtualBox.box --name centos7
$ vagrant plugin install vagrant-disksize
$ vagrant init centos7​

 

  • After this step, you should have a file called "Vagrantfile" in the same directory, open the file with an editor (vim for example) and below the line  config.vm.box = "centos7" add the following:

 

  config.vm.network "public_network"
  config.vm.network :forwarded_port, guest: 7180, host: 7180
  config.vm.network :forwarded_port, guest: 8889, host: 8889
  config.vm.network :forwarded_port, guest: 9870, host: 9870
  config.vm.network :forwarded_port, guest: 6080, host: 6080
  config.vm.network :forwarded_port, guest: 21050, host: 21050
  config.vm.hostname = "localhost"
  config.disksize.size = "80GB"
  config.vm.provision "shell", path: "VMSetup.sh"

config.vm.provider "virtualbox" do |vb|
     # Display the VirtualBox GUI when booting the machine
     vb.gui = true
     # Customize the amount of memory on the VM:
     vb.memory = "12024"
     vb.cpus = "8"
   end

 

  • Save the file and now we can init and bring up the VM:

 

$ vagrant up​

 

  • Now it'll ask to bridge to your public network (only for the first time) normally it's the one that you're connected on the internet, in my case is en0:carrossoni_0-1596670858039.png
  • After this, the VM will be provisioned and automated CDP process will start, this will take up to one hour depending on your connection since also it'll configure the VM and also install all the components for Cloudera Manager and the Services in an automated process located in https://github.com/carrossoni/CDPDCTrial/
  • The template and the cluster created at the end will contain the following services:

 

HUE

HDFS

Hive Metastore

Impala

Ranger

Zookeeper​

 

  • After the install you can add more services like Nifi, Kafka etc. depending on the number of resources that you've reserved for the VM.
  • After the execution you should see the exit below (this will take up about 30 min to one hour depending on your connection since it'll download all the packages and parcels necessary for provisioning CDP Runtime):
    carrossoni_0-1596688198149.png
  • After this the VM will reboot to do a fresh start, wait around 5 minutes for the services spin up and go to the next step.
  • Troubleshooting:
    • If the install process failed, likely it's a problem during the VM configuration if CM was installed you can try going to https://localhost:7180 directrly and finish the install process manually via Cloudera Manager UI
    • To ssh there's two options, the easy one is to simple go to directory that the Vagrantfile is located (that you have used to perform the setup of the VM) and type:

 

$ vagrant ssh​

 

 

    • The other option is to configure your VM in the Virtualbox UI to attach a USB and copy the clouderakey.pem file that was created during the automation process. Then you are able to ssh the machine via "ssh -i clouderakey.pem vagrant@cloudera"

After ssh using both scenarios you can sudo the box and start looking the machine, try to see if the hostname and ip in /etc/hosts is configured properly (most common issue since depends of your machine network).

If after the template import you have an error message, cloudera manager can show what's happening, work in the error and then resume the import cluster template process in the running commands tab. If you are in this step now normally is a matter to view logs and/or see if there isn't resources available, at the end you can restart the cluster to see if it's something that was stuck. This is normal since we are working in a constrained environment.

 

Step 2: Cloudera Data Platform Access

 

  1. After the automated process our CDP Runtime is ready (actually we've provisioned in only one step)! In your machine browser you can connect to the CM with the following URL: 

 

http://localhost:7180

 


carrossoni_2-1596500396316.png

 

  • Password will be admin/admin after the first login you can choose the 60-day trial option and click in "Continue":
    carrossoni_3-1596500396209.png
  • The Welcome page appears, click in the Cloudera Logo on the top left since we've already added a new cluster with the automated process:
    carrossoni_4-1596500396508.png
    carrossoni_1-1596688467519.png
  • At this point all the services are initiated, some errors may happen since we are working on a resource constraint environment, usually follow the logs that it'll be easy to see in Cloudera Manager what's happening, also you can suppress warning messages if it's not something critical.

We've our environment ready to work and learn more about CDP! 

 

HUE and Data Access

 

  1. You can log in in Hue from the URL http://localhost:8889/hue and for the first time we will use the user admin/admin, this will be the admin user for HUE:carrossoni_29-1596500396307.png
  2. For example, I'll upload data from the California COVID-19 Testing that I've downloaded to my machine.carrossoni_30-1596500396233.png
  3. In HUE go on the left panel and choose "Importer" → Type = File, choose /user/admin directory and then click in "Upload a file", choose your file (statewide_testing.csv) and then "Open". Now click in the file that you've uploaded and this will go to the next step:carrossoni_31-1596500396372.png
  4. Click in Next and HUE will infer the table name, field types etc, you can change or leave as is and click in "Submit":
    carrossoni_0-1596690309827.png
    carrossoni_32-1596500396317.png
  5. At the end you should see the success of the job, close the job status window, and click in the Query button:
  6. Now that we've hour data we can query and use Impala SQL in the data that we've uploaded!

    carrossoni_33-1596500396342.png

 

 

(Optional) Ranger Security Masking with Impala Example

 

  1. To start using/querying the environment with the system user/password that we've created (cloudera/cloudera)  first we need to enter in Ranger we need to allow access to this user, click in the Ranger service and then in Ranger Admin WebUI:carrossoni_34-1596500396279.png
  2.  Now we have the initial Ranger screen. Login with the user/password admin/cloudera123:carrossoni_35-1596500396231.png
  3. In the HADOOP SQL session click in the Hadoop SQL link. We will create a new policy to allow access to the new table but seeing the tested column in masked format with null results. For that click in the Masking tab and then Add New Policy with the following values:carrossoni_36-1596500395996.png
  4. Click in the Add button and now go back to the Access tab and Add New Policy Button with the following parameters:
    carrossoni_37-1596500396471.png
  5. Click in Add button and now our user should be ready to select only the data on this table with the masked values. First we'll configure the user in HUE, in the left panel click in the initial button and then in "Manage Users":
    carrossoni_38-1596500396283.png
  6. Click in "Add User" and then in username put cloudera with the password cloudera, you can skip step 2 and 3 clicking directly in Add user.
  7. Logout from HUE and login with our new create user, go to the query editor and select the data again:
    carrossoni_39-1596500396314.png

You should see the masked policy in action! 

 

Summary

In this blog we've learned:

  • How to Setup a Vagrant Centos 7 machine with Virtualbox and CDP Packages
  • Configure CDP-DC for the first run
  • Configure data access
  • Setup simple security policies with the masking feature

You can play with the services, install other parcels like Kafka/Nifi/Kudu to create a streaming ingestion pipeline, and query in real-time with Spark/Impala. Of course for that, you'll need more resources and this can be changed in the beginning during the VM Configuration.

24,937 Views
Comments

Hi @duhizjame 

 

As you may have see network issues is the most common problem since it depends of other variables.

 

Now after reading your process I understand that you're stuck on the download/distribute phase, normally this happense because of insufficient disk space since it needs to download all parcels and then use it to install, since the parcels are already in /opt/cloudera/parcel-repo this means that the process is ok.

 

Does the logs in /var/log/cloudera-scm-server show something?

 

Regards,

Luiz

avatar
New Contributor

I was also thinking  that the network will be the problem, not disk space, since the host  is unknown health all the time. 

The server logs don't show anything unusual. The agent logs show that it is heartbeating on host:

 

 

[root@cloudera ~]# netstat -an | grep -e 9000 -e 9001
tcp        0      0 10.0.2.15:9000          0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:9001          0.0.0.0:*               LISTEN     

 

 Could it be a problem that the 9001 port is open on localhost not on cloudera host(10.0.2.15)?
I do not know where is that setting for port 9001 in the config file for the agent.

 

When I try to install the host manually, this is  the  health  inspector log:

Inspect Hosts for Correctness 
Validations
Inspector ran on all 1 hosts.
Individual hosts resolved their own hostnames correctly.
No errors were found while looking for conflicting init scripts.
The following errors were found while checking /etc/hosts...
 View Details
In /etc/hosts on cloudera, the hostname cloudera is mapped to cloudera, whereas it should be mapped to 10.0.2.15.
All hosts resolved localhost to 127.0.0.1.
All hosts checked resolved each other's hostnames correctly and in a timely manner.
Host clocks are approximately in sync (within ten minutes).
Host time zones are consistent across the cluster.
The user hdfs is missing on the following hosts:
 View Details
cloudera
The user mapred is missing on the following hosts:
 View Details
The user zookeeper is missing on the following hosts:
 View Details
The user oozie is missing on the following hosts:
 View Details
The user hbase is missing on the following hosts:
 View Details
The user hue is missing on the following hosts:
 View Details
The user sqoop is missing on the following hosts:
 View Details
The user impala is missing on the following hosts:
 View Details
The user sentry is missing on the following hosts:
 View Details
The group hdfs is missing on the following hosts:
 View Details
The group mapred is missing on the following hosts:
 View Details
The group zookeeper is missing on the following hosts:
 View Details
The group oozie is missing on the following hosts:
 View Details
The group hbase is missing on the following hosts:
 View Details
The group hue is missing on the following hosts:
 View Details
The group hadoop is missing on the following hosts:
 View Details
The group hive is missing on the following hosts:
 View Details
The group sqoop is missing on the following hosts:
 View Details
The group impala is missing on the following hosts:
 View Details
The group sentry is missing on the following hosts:
 View Details
No conflicts detected between packages and parcels.
No kernel versions that are known to be bad are running.
No problems were found with /proc/sys/vm/swappiness on any of the hosts.
Transparent Huge Page Compaction is enabled and can cause significant performance problems. Run "echo never > /sys/kernel/mm/transparent_hugepage/defrag" and "echo never > /sys/kernel/mm/transparent_hugepage/enabled" to disable this, and then add the same command to an init script such as /etc/rc.local so it will be set on system reboot. The following hosts are affected:
 View Details
cloudera
Hue Python version dependency is satisfied.
Hue Psycopg2 version for PostgreSQL is satisfied for both CDH 5 and CDH 6.
1 hosts are reporting with NONE version
All checked hosts in each cluster are running the same version of components.
All managed hosts have consistent versions of Java.
All checked Cloudera Management Daemons versions are consistent with the server.
All checked Cloudera Management Agents versions are consistent with the server.
Version Summary
Hosts that do not belong to any cluster
All Hosts
cloudera
Component	Version	Hosts	Release	Version
Supervisord	3.4.0	cloudera	Unavailable	Not applicable
Cloudera Manager Agent	7.1.4	cloudera	6363010.el7	Not applicable
Cloudera Manager Management Daemons	7.1.4	cloudera	6363010.el7	Not applicable
Crunch (CDH 5 only)	Unavailable	cloudera	Unavailable	Not installed or path is incorrect
flume	Unavailable	cloudera	Unavailable	Not installed or path is incorrect

Nice article, @carrossoni !

I know it's been a while, but just saw this for the first time 🙂

Nice article, 

But not working any more.

Failing to find mariadb repo as mariadb 10 was archived.