Created on 04-07-2022 01:23 AM
When we develop applications for the Cloudera Data Platform, it is quite often necessary to use third-party libraries like NumPy, SciPy, Pandas, etc, or even different versions of existing components, such as Python.
On the other hand, installing and maintaining Python environments can be a complex and time-consuming task for a system administrator and extra caution needs to be taken when we talk about installing other versions of Python and its modules at the operating system level due to the requirements of Cloudera Manager and Cloudera Runtime.
You could also choose to install Python virtual environments, but that would still require effort to keep all cluster nodes up to date. Running python in virtual environments in YARN mode requires extra development effort and a significant increase in the total application size due to dependencies.
Another excellent option is to distribute Anaconda as a parcel, but be aware that generating custom parcels requires the Anaconda Enterprise version.
During Cloudera's Professional Services engagement sessions, many development teams and CDP administrators ask me what is the best way to solve this.
There is no easy answer to this question. As clusters are often shared across multiple teams and often managed by yet another, it becomes very difficult to achieve a common solution that everyone likes.
Although there is no simple solution for all types of scenarios, I was able to extract some important requirements, common to all cases:
Note: The proposal below is intended to serve as a reference only. When using it, be sure to test it properly in a non-productive environment.
To achieve our goal, the proposal is to use Parcels as a means of controlling versioning and distribution on all cluster nodes.
According to official Cloudera documentation:
Parcels are self-contained and installed in a versioned directory, which means that
multiple versions of a given parcel can be installed side-by-side. You can then
designate one of these installed versions as the active one.
The diagram below explains the lifecycle of the proposed solution:
As you may have noticed, the dotted line represents that a change was required, and therefore, a new version should be built.
In the next steps, for illustrative purposes only, we will create a customized version of a Parcel containing Python 3.6 and the Pandas library.
For the following steps, I assume you have a Linux server with a RedHat or compatible version, Internet access, and basic Unix knowledge.
For the following steps, it is necessary to download and compile the Cloudera Manager Extensions:
yum install -y git
yum install -y java-1.8.0-openjdk
yum install -y maven
git clone https://github.com/cloudera/cm_ext.git
cd cm_ext/validator
mvn package
ls target/validator.jar
mkdir -p /usr/local/parcels/MY_CONDA-3.6.10-0
cd /usr/local/parcels/MY_CONDA-3.6.10-0
bash /path/to/Miniconda3-latest-Linux-x86_64.sh
cd /usr/local/parcels/MY_CONDA-3.6.10-0
miniconda3/bin/conda install python=3.6.10
miniconda3/bin/python --version
miniconda3/bin/conda install pandas
cd /usr/local/parcels/MY_CONDA-3.6.10-0
mkdir meta
{
"schema_version": 1,
"name": "MY_CONDA",
"version": "3.6.10-0",
"setActiveSymlink": true,
"depends": "",
"replaces": "",
"conflicts": "",
"provides": [ ],
"scripts": {
"defines": "my_conda_env.sh"
},
"packages": [ ],
"components": [
{ "name" : "miniconda3",
"version" : "4.10.3",
"pkg_version": "4.10.3",
"pkg_release": "4.10.3"
},
{ "name" : "python",
"version" : "3.6.10",
"pkg_version": "3.6.10",
"pkg_release": "3.6.10"
}
],
"users": {
"spark": {
"longname" : "Spark",
"home" : "/var/lib/spark",
"shell" : "/usr/sbin/nologin",
"extra_groups": [ ]
}
},
"groups": [ ]
}
#!/bin/sh
# EOF
java -jar /path/to/validator.jar -p /usr/local/parcels/MY_CONDA-3.6.10-0/meta/parcel.json
java -jar /path/to/validator.jar -d /usr/local/parcels/MY_CONDA-3.6.10-0/
cd..
tar zcf MY_CONDA-3.6.10-0-el7.parcel MY_CONDA-3.6.10-0/ --owner=root --group=root
java -jar /path/to/validator.jar -f /usr/local/parcels/MY_CONDA-3.6.10-0-el7.parcel
sha1sum < MY_CONDA-3.6.10-0-el7.parcel | cut -d ' ' -f 1 > MY_CONDA-3.6.10-0-el7.parcel.sha
sudo chown cloudera-scm: /opt/cloudera/parcel-repo/MY_CONDA-3.6.10-0-el7.parcel*
/opt/cloudera/parcels/MY_CONDA/miniconda3/bin/python --version
While this article is not intended to be a definitive guide on this subject, as each company has their own requirements, consider this a simple introduction to how Parcels work in the CDP environment and how to leverage it to get more productivity in Cloudera environments.
Parcel is a binary distribution format that allows us to easily install, update or even remove a set of files in a simple, uniform, versioned, consistent, and distributed way in a Cloudera environment.
We can leverage this to distribute any set of files, such as Java or Python dependencies, different versions of Python, Hive UDFs, HBase coprocessors, scripts, third-party tools, etc. In addition, it is also possible to integrate with existing components, such as adding a library to the classpath of Hive, HBase, Spark, etc.
To learn more about Parcels, including advanced usage and integration with existing Cloudera's components, see the following links:
Created on 12-12-2022 04:56 PM
What about the manifest file ? how to create it ?
Created on 03-22-2024 10:25 AM
If you are facing issues with "mvn package" command , please uninstall maven package and install maven 3.6.x.
Also DO NOT change your directory to "cd cm_ext/validator" instead stay on "cd cm_ext " and the execute "mvn package" command