Member since
02-18-2021
2
Posts
6
Kudos Received
0
Solutions
05-17-2022
09:56 AM
3 Kudos
What is Apache Knox? In summary, Apache Knox was designed to provide access to the Big Data environment through a reverse proxy gateway, enabling perimeter protection when combined with a firewall. The Cloudera Data Platform supports Apache Knox and makes it simpler to install and administer by integrating it with the other components of the platform. Please, check the Cloudera Security Overview for additional information about how to increase the security in CDP. The figure below shows a high-level architecture of Apache Knox: When using the Apache Knox Gateway we benefit from a number of advantages, such as: Single sign-on and enterprise authentication Perimeter security Central access management Granular access control to cluster services Proxied JDBC connections and streaming Extensible API Etc. Please, check the Apache Knox official site for more information. However, as all connections go through Knox, it becomes a critical piece of access to the environment. So, how can we answer the following questions: How to monitor the health of this gateway? What services are most used? How to know the number of requests per service? How to measure or even monitor response times? Knox Architecture Overview Before answering these questions, let's take a step back and take a look at the Apache Knox architecture. Apache Knox Gateway is built on top of Jetty Web Server and designed to be extensible. In other words, it is possible to choose which extensions we want to enable, in order to customize the service to meet our needs. In addition, it is possible to create new extensions for specific needs. The example below shows how Apache Knox enables a user to connect to Hive, HBase, etc. While the service makes it possible to integrate solutions and expose endpoints to users, the providers make it possible to extend existing functionality, enabling its use by all services. Below are some examples of both component types: Services Providers gateway-service-hbase gateway-provider-identity-assertion-regex gateway-service-health gateway-provider-rewrite gateway-service-hive gateway-provider-security-authz-acls gateway-service-oozie gateway-provider-security-jwt gateway-service-webhdfs gateway-provider-security-shiro In order to customize the Apache Knox services and providers, we need to create a topology file. This file is responsible for defining the services and their respective endpoints for Knox to expose the services to users. Below we have an example for the topology definition: <topology>
<gateway>
<provider>
<!-- provider definition here -->
</provider>
:
</gateway>
<service>
<!-- service definition here -->
</service>
:
</topology> The screenshot below shows the Apache Knox login page: The next one is the main page, including all configured topologies. Note that in this example the cdp-proxy topology has been configured to provide access to Atlas, Cloudera Manager, HBase, NameNode, Ranger and Solr. To get access to this page, navigate to /gateway/homepage/home We can also access the admin page at /gateway/manager/admin-ui: Enabling Metrics Now that we know how Knox works and what a topology is, let's configure the metrics service. The first step to accessing the Apache Knox metrics endpoint is enabling the metrics service. For this, it is necessary to implement a new topology, according to the example below: Create a file at /var/lib/knox/gateway/conf/topologies/health.xml, adjusting to your needs. <topology>
<gateway>
<provider>
<role>authentication</role>
<name>ShiroProvider</name>
<enabled>true</enabled>
<param>
<name>main.pamRealm</name>
<value>org.apache.knox.gateway.shirorealm.KnoxPamRealm</value>
</param>
<param>
<name>main.pamRealm.service</name>
<value>login</value>
</param>
<param>
<name>sessionTimeout</name>
<value>30</value>
</param>
<param>
<name>urls./**</name>
<value>authcBasic</value>
</param>
</provider>
<provider>
<role>authorization</role>
<name>AclsAuthz</name>
<enabled>false</enabled>
<param>
<name>knox.acl</name>
<value>admin;*;*</value>
</param>
</provider>
<provider>
<role>identity-assertion</role>
<name>HadoopGroupProvider</name>
<enabled>true</enabled>
<param>
<name>CENTRAL_GROUP_CONFIG_PREFIX</name>
<value>gateway.group.config.</value>
</param>
</provider>
</gateway>
<service>
<role>HEALTH</role>
</service>
</topology> You can also duplicate a topology from the admin user interface and change the cloned topology as you need. Now we can test the endpoint using the following command. Notice that the metrics are still empty. $ curl -ku user:password "https://knox-server:8443/gateway/health/v1/metrics?pretty=true"
{
"version" : "4.0.0",
"gauges" : { },
"counters" : { },
"histograms" : { },
"meters" : { },
"timers" : { }
} Enabling Metrics for Services Now that we've enabled the endpoint to collect the metrics, it's time to produce the metrics. To do this, you need to enable the following properties in Cloudera Manager > Knox > Configuration: Collecting the Metrics Before collecting the metrics, we need to generate some traffic, otherwise no metrics will be produced on the endpoints. Briefly browse through Knox endpoints so that some metrics can be generated. Finally, let's collect the metrics: $ curl -ku knoxui:knoxui "https://knox-server:8443/gateway/health/v1/metrics?pretty=true"
{
"version" : "4.0.0",
"gauges" : {
"PS-MarkSweep.count" : {
"value" : 3
},
"PS-MarkSweep.time" : {
"value" : 341
},
"PS-Scavenge.count" : {
"value" : 34
},
"PS-Scavenge.time" : {
"value" : 543
},
"blocked.count" : {
"value" : 0
},
"count" : {
"value" : 49
},
"daemon.count" : {
"value" : 27
},
"deadlock.count" : {
"value" : 0
},
"deadlocks" : {
"value" : [ ]
},
"direct.capacity" : {
"value" : 229778
},
"direct.count" : {
"value" : 25
},
"direct.used" : {
"value" : 229778
},
"heap.committed" : {
"value" : 1008205824
},
"heap.init" : {
"value" : 1073741824
},
"heap.max" : {
"value" : 1008205824
},
"heap.usage" : {
"value" : 0.1354310764227444
},
"heap.used" : {
"value" : 136542400
},
"loaded" : {
"value" : 14253
},
"mapped.capacity" : {
"value" : 0
},
"mapped.count" : {
"value" : 0
},
"mapped.used" : {
"value" : 0
},
"name" : {
"value" : "210554@nightly-71x-nu-1.nightly-71x-nu.root.hwx.site"
},
"new.count" : {
"value" : 0
},
"non-heap.committed" : {
"value" : 140599296
},
"non-heap.init" : {
"value" : 2555904
},
"non-heap.max" : {
"value" : -1
},
"non-heap.usage" : {
"value" : -1.36887128E8
},
"non-heap.used" : {
"value" : 136887128
},
"pools.Code-Cache.committed" : {
"value" : 37879808
},
"pools.Code-Cache.init" : {
"value" : 2555904
},
"pools.Code-Cache.max" : {
"value" : 251658240
},
"pools.Code-Cache.usage" : {
"value" : 0.1494166056315104
},
"pools.Code-Cache.used" : {
"value" : 37601920
},
"pools.Compressed-Class-Space.committed" : {
"value" : 10616832
},
"pools.Compressed-Class-Space.init" : {
"value" : 0
},
"pools.Compressed-Class-Space.max" : {
"value" : 1073741824
},
"pools.Compressed-Class-Space.usage" : {
"value" : 0.009245157241821289
},
"pools.Compressed-Class-Space.used" : {
"value" : 9926912
},
"pools.Metaspace.committed" : {
"value" : 92102656
},
"pools.Metaspace.init" : {
"value" : 0
},
"pools.Metaspace.max" : {
"value" : -1
},
"pools.Metaspace.usage" : {
"value" : 0.9702731699724273
},
"pools.Metaspace.used" : {
"value" : 89364736
},
"pools.PS-Eden-Space.committed" : {
"value" : 230686720
},
"pools.PS-Eden-Space.init" : {
"value" : 268435456
},
"pools.PS-Eden-Space.max" : {
"value" : 232783872
},
"pools.PS-Eden-Space.usage" : {
"value" : 0.05936479998064471
},
"pools.PS-Eden-Space.used" : {
"value" : 13819168
},
"pools.PS-Eden-Space.used-after-gc" : {
"value" : 0
},
"pools.PS-Old-Gen.committed" : {
"value" : 716177408
},
"pools.PS-Old-Gen.init" : {
"value" : 716177408
},
"pools.PS-Old-Gen.max" : {
"value" : 716177408
},
"pools.PS-Old-Gen.usage" : {
"value" : 0.11489939654728679
},
"pools.PS-Old-Gen.used" : {
"value" : 82288352
},
"pools.PS-Old-Gen.used-after-gc" : {
"value" : 69268496
},
"pools.PS-Survivor-Space.committed" : {
"value" : 61341696
},
"pools.PS-Survivor-Space.init" : {
"value" : 44564480
},
"pools.PS-Survivor-Space.max" : {
"value" : 61341696
},
"pools.PS-Survivor-Space.usage" : {
"value" : 0.66009521484375
},
"pools.PS-Survivor-Space.used" : {
"value" : 40491360
},
"pools.PS-Survivor-Space.used-after-gc" : {
"value" : 40491360
},
"runnable.count" : {
"value" : 8
},
"terminated.count" : {
"value" : 0
},
"timed_waiting.count" : {
"value" : 30
},
"total.committed" : {
"value" : 1148805120
},
"total.init" : {
"value" : 1076297728
},
"total.max" : {
"value" : 1008205823
},
"total.used" : {
"value" : 273550024
},
"unloaded" : {
"value" : 20
},
"uptime" : {
"value" : 7181810
},
"vendor" : {
"value" : "AdoptOpenJDK OpenJDK 64-Bit Server VM 25.232-b09 (1.8)"
},
"waiting.count" : {
"value" : 11
}
},
"counters" : { },
"histograms" : { },
"meters" : { },
"timers" : {
"client./gateway/cdp-proxy-api/webhdfs/v1.GET-requests" : {
"count" : 2,
"max" : 2.345871303,
"mean" : 1.2936517390995506,
"min" : 0.27252996300000004,
"p50" : 0.27252996300000004,
"p75" : 2.345871303,
"p95" : 2.345871303,
"p98" : 2.345871303,
"p99" : 2.345871303,
"p999" : 2.345871303,
"stddev" : 1.0365540554822608,
"m15_rate" : 1.4747582595749215E-4,
"m1_rate" : 1.2646567866717584E-52,
"m5_rate" : 2.0046683275245812E-11,
"mean_rate" : 2.8082304509532665E-4,
"duration_units" : "seconds",
"rate_units" : "calls/second"
},
"client./gateway/health/v1/metrics.GET-requests" : {
"count" : 3,
"max" : 1.6867495060000002,
"mean" : 0.31597803,
"min" : 0.31597803,
"p50" : 0.31597803,
"p75" : 0.31597803,
"p95" : 0.31597803,
"p98" : 0.31597803,
"p99" : 0.31597803,
"p999" : 0.31597803,
"stddev" : 5.5409548260855284E-21,
"m15_rate" : 4.943702006689124E-4,
"m1_rate" : 8.06508251818969E-9,
"m5_rate" : 1.8189077642291002E-4,
"mean_rate" : 4.20218023620169E-4,
"duration_units" : "seconds",
"rate_units" : "calls/second"
},
"service./gateway/cdp-proxy-api/webhdfs/v1/.get-requests" : {
"count" : 3,
"max" : 0.009574565,
"mean" : 0.007894472334077354,
"min" : 0.005590579,
"p50" : 0.008588437,
"p75" : 0.009574565,
"p95" : 0.009574565,
"p98" : 0.009574565,
"p99" : 0.009574565,
"p999" : 0.009574565,
"stddev" : 0.0017015385392143616,
"m15_rate" : 2.2244612427544609E-4,
"m1_rate" : 2.061840874032061E-52,
"m5_rate" : 3.0575391686277505E-11,
"mean_rate" : 4.213687720803925E-4,
"duration_units" : "seconds",
"rate_units" : "calls/second"
}
}
} Conclusion In this article we were able to see an overview of Apache Knox and also how to enable the metrics service. Once this service is active, it is possible to monitor the access statistics of Apache Knox and with that to foresee possible problems or bottlenecks in the access to the services. To learn more about Apache Knox, please see the following links: Apache Knox Gateway Apache Knox Home Page Apache Knox Overview Apache Knox User's Guide Apache Knox Developer's Guide
... View more
04-07-2022
01:23 AM
3 Kudos
When we develop applications for the Cloudera Data Platform, it is quite often necessary to use third-party libraries like NumPy, SciPy, Pandas, etc, or even different versions of existing components, such as Python.
On the other hand, installing and maintaining Python environments can be a complex and time-consuming task for a system administrator and extra caution needs to be taken when we talk about installing other versions of Python and its modules at the operating system level due to the requirements of Cloudera Manager and Cloudera Runtime.
You could also choose to install Python virtual environments, but that would still require effort to keep all cluster nodes up to date. Running python in virtual environments in YARN mode requires extra development effort and a significant increase in the total application size due to dependencies.
Another excellent option is to distribute Anaconda as a parcel, but be aware that generating custom parcels requires the Anaconda Enterprise version.
What's the best option for me?
During Cloudera's Professional Services engagement sessions, many development teams and CDP administrators ask me what is the best way to solve this.
There is no easy answer to this question. As clusters are often shared across multiple teams and often managed by yet another, it becomes very difficult to achieve a common solution that everyone likes.
Although there is no simple solution for all types of scenarios, I was able to extract some important requirements, common to all cases:
It should be simple to build and maintain
It should be easy to install and update
The install and update process should be automated
It must have low or no dependency on other OS libraries
Most popular modules must be pre-installed
Solution proposal
Note: The proposal below is intended to serve as a reference only. When using it, be sure to test it properly in a non-productive environment.
To achieve our goal, the proposal is to use Parcels as a means of controlling versioning and distribution on all cluster nodes.
What are parcels?
According to official Cloudera documentation:
Parcels are self-contained and installed in a versioned directory, which means that multiple versions of a given parcel can be installed side-by-side. You can then designate one of these installed versions as the active one.
Solution Lifecycle
The diagram below explains the lifecycle of the proposed solution:
As you may have noticed, the dotted line represents that a change was required, and therefore, a new version should be built.
In the next steps, for illustrative purposes only, we will create a customized version of a Parcel containing Python 3.6 and the Pandas library.
Creating a new Parcel
For the following steps, I assume you have a Linux server with a RedHat or compatible version, Internet access, and basic Unix knowledge.
STEP 1: Prepare your environment
For the following steps, it is necessary to download and compile the Cloudera Manager Extensions:
Install git: yum install -y git
Install Java JDK: yum install -y java-1.8.0-openjdk
Install Maven 3: yum install -y maven
Clone the cm_ext project: git clone https://github.com/cloudera/cm_ext.git
Go to the Validator Project directory: cd cm_ext/validator
Build Validator: mvn package
Look at the target directory and make sure the validator.jar exists as we will use it later: ls target/validator.jar
STEP 2: Start a new Parcel
Create a directory for your our Parcel: mkdir -p /usr/local/parcels/MY_CONDA-3.6.10-0
Notice that the version could be any version you want, as long as you follow the PACKAGENAME-VERSION format.
Go to the Parcel directory: cd /usr/local/parcels/MY_CONDA-3.6.10-0
STEP 3: Download and install the Miniconda
Download miniconda from https://docs.conda.io/en/latest/miniconda.html
Install miniconda in /usr/local/parcels/MY_CONDA-3.6.10-0/miniconda3 and don't forget to read and agree with the licensing terms: bash /path/to/Miniconda3-latest-Linux-x86_64.sh
Go to the Parcel directory: cd /usr/local/parcels/MY_CONDA-3.6.10-0
Install the Python 3.6.10 version: miniconda3/bin/conda install python=3.6.10
Check Python version: miniconda3/bin/python --version
Install pandas: miniconda3/bin/conda install pandas
At this point, you can install other required Python libraries.
STEP 4: Setup your Parcel
Go to the Parcel directory: cd /usr/local/parcels/MY_CONDA-3.6.10-0
Create a meta-directory: mkdir meta
Create a meta/parcel.json file: {
"schema_version": 1,
"name": "MY_CONDA",
"version": "3.6.10-0",
"setActiveSymlink": true,
"depends": "",
"replaces": "",
"conflicts": "",
"provides": [ ],
"scripts": {
"defines": "my_conda_env.sh"
},
"packages": [ ],
"components": [
{ "name" : "miniconda3",
"version" : "4.10.3",
"pkg_version": "4.10.3",
"pkg_release": "4.10.3"
},
{ "name" : "python",
"version" : "3.6.10",
"pkg_version": "3.6.10",
"pkg_release": "3.6.10"
}
],
"users": {
"spark": {
"longname" : "Spark",
"home" : "/var/lib/spark",
"shell" : "/usr/sbin/nologin",
"extra_groups": [ ]
}
},
"groups": [ ]
}
Create an empty meta/my_conda_env.sh file: #!/bin/sh
# EOF
Validate the parcel.json file: java -jar /path/to/validator.jar -p /usr/local/parcels/MY_CONDA-3.6.10-0/meta/parcel.json
Validate the parcel's directory: java -jar /path/to/validator.jar -d /usr/local/parcels/MY_CONDA-3.6.10-0/
Move to the parent directory: cd..
And, package the parcel as TAR.GZ targeting to a RedHat EL 7: tar zcf MY_CONDA-3.6.10-0-el7.parcel MY_CONDA-3.6.10-0/ --owner=root --group=root
Validate the new generated parcel: java -jar /path/to/validator.jar -f /usr/local/parcels/MY_CONDA-3.6.10-0-el7.parcel
Sign the parcel: sha1sum < MY_CONDA-3.6.10-0-el7.parcel | cut -d ' ' -f 1 > MY_CONDA-3.6.10-0-el7.parcel.sha
STEP 5: Install and distribute the Parcel
Copy the parcel and the sha files to the /opt/cloudera/parcel-repo directory in the Cloudera Manager node.
Change the permissions: sudo chown cloudera-scm: /opt/cloudera/parcel-repo/MY_CONDA-3.6.10-0-el7.parcel*
Go to Cloudera Manager > Parcels, and click Check for New Parcels
After the parcels are detected, click Distribute
Click Activate to activate the Parcel
Make sure everything is working in all nodes: /opt/cloudera/parcels/MY_CONDA/miniconda3/bin/python --version
Summary
While this article is not intended to be a definitive guide on this subject, as each company has their own requirements, consider this a simple introduction to how Parcels work in the CDP environment and how to leverage it to get more productivity in Cloudera environments.
Parcel is a binary distribution format that allows us to easily install, update or even remove a set of files in a simple, uniform, versioned, consistent, and distributed way in a Cloudera environment.
We can leverage this to distribute any set of files, such as Java or Python dependencies, different versions of Python, Hive UDFs, HBase coprocessors, scripts, third-party tools, etc. In addition, it is also possible to integrate with existing components, such as adding a library to the classpath of Hive, HBase, Spark, etc.
To learn more about Parcels, including advanced usage and integration with existing Cloudera's components, see the following links:
Overview of Parcels
Running Spark Python applications
Installing Anaconda in Cloudera CDH
Conda Project
Cloudera Manager Extensions
Parcels: What and Why?
Why you should use Parcels
The Parcel Format
Building a Parcel
The Parcel Repository Format
... View more