Member since
11-19-2018
4
Posts
0
Kudos Received
0
Solutions
05-31-2021
01:04 AM
Cloudera Data Engineering (CDE) is Cloudera's new Spark as a Service offering on Public Cloud. It features kubernetes auto-scaling of Spark workers for efficient cost optimization, a simple UI interface for job management, and an integrated Airflow Scheduler for managing your production-grade workflows.
In this article, we will not focus on service deployment. For that, refer to Enabling Cloudera Data Engineering.
What we will cover is how to adjust your applications to fit CDE's paradigm, deploy them and schedule them by working through some example applications in Python and Scala.
Developing Apps for Cloudera Data Engineering
In CDE, an application is called a job. This takes the form of a main jar file or Python script. Depending on which language you wish to use, the development setup will be slightly different. We will go through those at a high level later.
For creating and managing jobs, you can interact with CDE through the UI or the CLI. We will mainly use the UI here. For instructions on setting up the CLI, refer to Using the Cloudera Data Engineering command line interface
The CLI will be needed for CI/CD integration and also access to some of the deeper functionality within CDE, but is not needed for most day-to-day tasks. It can also be used by external applications to monitor the status of CDE jobs.
Structure of this repo
To support this article, we have developed some example applications. Refer to https://github.com/Data-drone/cde_data_ingest.The application code sits under 'src/main/python' and 'src/main/scala' respectively. Code for the airflow example DAG sits under 'src/main/airflow'.
Writing Apps for CDE - Common Concepts
With CDE, the `SparkSession` setting for `executor` and `driver` should be set through CDE options in UI or CLI rather than in the application itself. Including them in the code as well can result in abnormal behavior.
######## DON'T
### Standalone Spark / CDH Spark
spark = SparkSession \
.builder \
.appName("Load Data") \
.config("spark.executor.cores", "6") \
.config("spark.executor.memory", "10g") \
.config("spark.num.executors", "2") \
.enableHiveSupport() \
.getOrCreate()
######## DO
### CDE Spark
spark = SparkSession \
.builder \
.appName("Load Data") \
.enableHiveSupport() \
.getOrCreate()
With the CDE CLI, this comes as part of the cde job create command.
# cde cli job for the above settings
cde job create --application-file <path-to-file> --name <job_name> --num-executors 2 --executor-cores 6 --executor-memory "10g" --type spark
## Note Jobname is a CDE property and doesn't correspond to appName
For configuration flags, they should be added under the Configurations section of the UI.
or the '--conf' option in the 'cde cli'
cde job create --application-file <path-to-file> --name <job_name> --conf spark.kerberos.access.hadoopFileSystems=s3a://nyc-tlc,s3a://blaw-sandbox2-cdp-bucket --type spark
Example Jobs
In this example, we will read in some of the nyc taxi dataset. Refer to Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance for a full write-up on the NYC Taxi dataset and cool things that can be done with it.
For the purposes of this exercise, we will read in the pre_2015 NYC Taxi dataset and the first half of 2015 taxi dataset from the AWS open data registry.
For this particular exercise, we need to access the AWS open data `s3a://nyc-tlc` bucket. By default, CDP Security settings only allow access to the CDP data bucket setup during the initialization of the CDP Environment. Refer to External AWS Bucket Access in CDP Public Cloud for a full write-up on how to access external buckets for CDP Public Cloud.
For the purposes of this exercise, I added a policy to my `DataLake Admin Role` to access `nyc-tlc`. To ensure that the Spark Applications in CDE can take advantage of this, I added in the Spark configuration 'spark.kerberos.access.hadoopFileSystems' with the value set to 's3a://nyc-tlc,s3a://blaw-sandbox2-cdp-bucket'. In my case, my default CDP data bucket is 'blaw-sandbox2-cdp-bucket' but yours will vary.
Building Jobs for CDE - Python
For Python developers, explore the folder: 'src/main/python' in there are two jobs, load_data.py and etl_job.py. Both are quite simple Spark applications with no external dependencies. We will get to dependency management later.
Exploring load_data.py, you can see that the `SparkSession` segment is quite concise. Also of note is that the Cloudera SDX layer, sets the Kerberos settings and makes sure that aws iam roles are handled automatically; there is no need to assign these in code through Spark configuration.
For deploying the `load_data` script, we can simply define it as a new job in the CDE UI.
For this particular example, we set up the load_data script used the following settings:
For the etl_job` script, we set it up like below:
Once they are running, we can see it appear under jobs runs:
Python Dependency Management
For many advanced jobs, pure PySpark may not suffice and it may be necessary to import in Python packages. For in-house libraries and extra scripts, the approach is straightforward. The CDE UI or the `--py-file`, `--jar` and `--file` flags in the CLI allow for attaching extra packages and files into the execution environment as per a standard `spark-submit`.
It is quite common to leverage external repositories like `PyPI`, however and these cannot simply be attached. Common solutions include using things like `conda-pack`, `venv` or `PEX` to package up Python libraries into a bundle to be unzipped by the executors as part of running the job.
With CDE, we have a built-in option akin to the `venv` solution. Note at the moment of writing, setting this up can only be done via the CLI. All we need for this is a standard `requirements.txt` file as is standard practice with Python projects and our CDE CLI all setup.
CDE has the concept of resources which is a collection of files, environments, etc that can be mounted into the instances running a Spark job.
To create a new resource, run:
cde resource create --name my_custom_env --type python-env
To upload a `requirements.txt, run:
cde resource upload --name my_custom_env --local-path requirements.txt
In this case, `--local-path` refers to the local instance where you are running the CDE CLI.
There are currently limitations with this process. Many advanced Data Science and Analytics libraries are not pure Python code and require extra OS-level libraries or compiling C++ libraries to ensure good performance.
For these use-cases, a fully customized image is required for CDE to start the drivers and executors with. Note that this is currently in Tech Preview and will require the assistance of your account team to activate. Building a full custom environment requires the use of Docker and proper Cloudera Paywall credentials to pull the base image from the Cloudera docker repo.
See `cde-docker-image` for an example image. Things to note are the base image:
FROM container.repository.cloudera.com/cloudera/dex/dex-spark-runtime-3.1.1:1.7.0-b129
Depending on what version of Spark and what version of CDE is being run, a different base image will be required. The base image, though it is an Ubuntu image, is set up to use `dnf` for package management, not `apt` as per normal.
Once you have your image built and uploaded to a docker repo that CDE can reach, it is possible to create a CDE job that leverages this. Note, customizing the base image can also only be done in the CLI at this stage.
First, create a resource as before
cde resource create --type="custom-runtime-image" --image-engine="<spark2 or 3>" --name="<runtime_name>" --image="<path_to_your_repo_and_image>"
Then, create your job on top of this image:
cde job create --type --name <your_job_name> --runtime-image-resource-name <runtime_name> --application-file <your_job_file>
For more information, including using password-protected docker repos, refer to Managing Python dependencies for Spark workloads in Cloudera Data Engineering.
Building Jobs for CDE - Scala
To work with Scala in CDE, you need to have an environment in order to build jar files. In this particular repo, we will leverage sbt, though maven is an option too. Refer to Installing sbt for more information on setting up a sbt environment for compiling jars. It may be useful to set this up in a docker container for easy reproducibility.
There are two key parts of a Scala project. The build.sbt file is in the root path where we will run `sbt`. This file includes information about the required libraries. Of note, in the sample, one is the `provided` flag, which indicates to sbt that these are in the build environment already. The CDE base image already includes Spark/ Hadoop and other base dependencies. Including them again could cause issues.
For novice Scala developers, it is also important to note that Scala versions are not cross-compatible. Spark 3.x only supports Scala 2.12.x, for example. Once we have correctly setup our Scala code under `src/main/scala` with an accompanying build.sbt in the base directory, we can run sbt clean package.
For first-time Scala developers, it is worth noting that there are two main approaches to building jar files for Spark. Jars can be built with all external dependencies included. This can make life easier but it also makes jar files larger.
They can also be built with no dependencies in case external dependencies. In our case, we included the Deequ library for data validation. So, we need to attach the Deequ jar for Scala 2.12 as one of the extra jars to our job. This can get pretty messy pretty fast though. For example, 'geomesa-spark-jts', which adds GIS functionality to Spark also depends in turn on other external libraries. We would have to identify these and also upload them.
Jars without dependencies are commonly called slim jars and those with dependencies are called fat jars.
To build jars with dependencies, add a `project` folder with a file `plugins.sbt`. Inside this file, add:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.9")
The build command then becomes 'sbt clean assembly'. For more information on using sbt for Spark jobs, refer to Introduction to SBT for Spark Programmers. For extra details on the assembly plugin refer to https://github.com/sbt/sbt-assembly
For small Spark jobs, CDE does support uploading a script directly through the CLI and the system will compile it on the fly. To execute your job in this manner, run:
cde spark submit <scala_job>.scala --job-name <your_job_name>
Scheduling Jobs
By default, CDE supports cron for scheduling jobs. Cron has a specific syntax like `0 12 * * ?`. To understand what these expressions mean and how to write one, refer to A Guide To Cron Expressions. Alternatively, try this tool to easily test out cron expressions and what they mean: https://crontab.guru/#*/30_*_*_*_*
Whilst powerful, cron doesn't help with scheduling interdependencies between jobs. For this, we need to leverage the built-in Airflow scheduler. Airflow is a powerful scheduling tool originally developed by Airbnb that allows for coding up advanced interdependencies between jobs.
The core unit in airflow is a Directed Acyclical Graph (DAG). The DAG determines the executor order of the task and sets the parameters for each task to be run. Tasks in Airflow are called `Operators` since each one "operates" to execute a different task. For executing CDE jobs, we have a `CDEJobRunOperator` which is defined as follows:
my_airflow_task = CDEJobRunOperator(
task_id='loader', # a name to identify the task
dag=cde_process_dag, # the DAG that this operator will be attached to
job_name='load_data_job' # the job that we are running.
)
Note: The `job_name` must match how the job was set up in CDE.
For example, with the above setup, we could sequence the Spark jobs `etl_job_expanded`, `load_data_expanded`, `etl_job` and `load_data_job`. In Airflow, individual DAGs and their associated operators are defined in Python files. For example, see `src/main/airflow/airflow_job.py`.
Note that there can only be one DAG per Airflow job. For most workflows, this will suffice. Airflow offers support for branching and merging dependencies too. The only time when we might need to coordinate multiple DAGs is if we need `Operators` running at different time intervals that have interdependencies. That is a pretty advanced topic and we will not cover this here.
With the individual operators and the DAG defined, we can set the execution order of the jobs. For example.
start >> load_data_job >> etl_job >> end
This means that the execution order of the operators is `start`, then `load_data_job`, then `etl_job`, and finally `end`. We can also define the dependencies across multiple lines. For example, to do branching, we could define something as follows:
start >> load_data_job
load_data_job >> etl_job_1
load_data_job >> etl_job_2
Troubleshooting and Debugging
So now, you can set up and schedule Python and Scala jobs in CDE. But what happens when things go wrong? First, we need to have some understanding of how CDE fits together. In CDP-Public Cloud, everything starts with the environment.
An environment consists of a CDP DataLake, which also contains the core security services. Attached to this is a CDE Service. There can be multiple CDE Services per DataLake. The way that a CDE Service is started determines what compute instances jobs can run on and the maximum number of compute instances that can be active.
For each CDE service, we can have multiple virtual clusters. The Spark version and max autoscale capacity of jobs are set on the virtual cluster level. In short:
CDP Environment (DataLake)
- CDE Service 1 (node-level scaling set here)
- Virtual Cluster 1 (job-level scaling set here)
- Job 1
- Job 2..x
- Virtual Cluster 2...x
- CDE Service ... x
Depending on the type of issues that we are troubleshooting, we will need to examine what is going on on different levels.
Understanding capacity issues
A common issue is running out of capacity. Remember we set this on two levels. A hard cap on the number of nodes allowed at the Service level and also a hard cap on the resources for a specific virtual cluster too.
Above we can see the Menu screen for the CDE service. Note that we can see that current 0/75 nodes are active with 0/1200 CPUs being used and 0/4800 GB memory being used. We can monitor the status of the cluster with the Grafana charts and also the Resource Scheduler buttons.
The Grafana Charts show the status of the cluster and a general number of jobs running.
To understand a bit more about what our cluster is up tom we need to check the Resource Scheduler as well. Clicking on that tab will take us to:
Under the Queue tab in the Yunikorn menu, we can see the current status of the queues in the system. Now, this is the first confusing bit. Why is the queue showing so little memory and why is CPU 51% utilized? This is because Yunikorn only knows about the nodes that are currently active and not how many we can potentially spin up. As jobs run and the cluster scales, we will see this number increasing
On the Virtual Cluster level, we have this cluster overview. Unfortunately, we cannot see why many of the 70 CPU and 422Gi of memory is consumed.
But, we can go into the Grafana screen for the virtual cluster and see how many resources we are using and also potentially requesting.
Through the analysis of those screens, we can understand the state of resource consumption in the CDE Service. Issues with jobs starting will commonly be due to issues at this level. With either the YuniKorn scheduler being unable to schedule jobs, or the cluster autoscaler not being able to add nodes. It is worth pointing out at this juncture that for AWS at least there can be limits on the numbers of certain instances that an account can spin up. In heavy environments, it can be easy to hit those caps.
To understand cluster level stats and issues, check the Diagnostics tab in the menu at the CDE Service level. This provides the ability to download a Diagnostics Bundle and also a Summary status. Our support teams will also require these in order to be able to troubleshoot issues in depth.
Diagnosing Jobs and Spark Code
So, our job started fine, but it failed to complete successfully. In this case, we will have to look into the job run itself. Here is an example of a failed job:
The first place to check is the logs:
As we can see above, there are the `stderr` logs these are from the OS tier. The `stdout` logs are the errors from Spark itself.
In this case, we can see that the job failed as the director that the job was trying to write to already existed. Once we correct that, the job will run fine as we can see from the successful second run above.
Conclusion
That was a quick whirlwind tour of Cloudera Data Engineering, Cloudera's cloud-native autoscaling Spark Service. Happy Data Engineering!
Further Reading
Walmart Global Tech - Airflow Beginners Guide
Understanding Cron timers
Understanding Airflow DAGs
Compiling JARs for Spark
Airflow Scheduling for CDE and CDW
... View more
01-27-2021
09:11 PM
Spark 3 on CDP Private Cloud Base with GPU Support is fully supported. RAPIDS.ai is a Nvidia product though and not a Cloudera product offering
... View more
11-15-2020
05:19 PM
2 Kudos
RAPIDS.ai is a set of Nvidia libraries designed to help data scientists and engineers to leverage the Nvidia GPUs to accelerate all the steps in the data science process. From data exploration through to model development. Recently, with the inclusion of GPU processing support in Spark 3, RAPIDS has released a set of plugins designed to accelerate Spark DataFrame operations with the aid of GPU accelerators.
This article gives an introduction to setting up RAPID.ai on Cloudera's CDP Platform.
Before starting, ensure that you have access to the following requirements:
Prerequisites
Cloudera CDP Private Cloud Base 7.1.3+
Cloudera CDS 3
Centos 7
Epel Repo already added and installed
Kernel 3.10 (deployment script was just tested on this version)
Check run `uname -r`
Full administrator and sudo rights
Nvidia Drivers
Firstly in order to be able to support GPUs in the OS layer, we need to install the Nvidia Hardware drivers. See: cdp_rapids for ansible helper scripts for installing Nvidia drivers. The ansible_cdp_pvc_base folder contains a ansible playbooks to install Nvidia Drivers on Centos 7. It has been tested successfully on AWS and should in theory work for any Centos 7 install. Specific instructions for rhel / centos 8 and ubuntu flavours are not included here. See Nvidia's official notes or the plenty of other articles on the web for that.
Later Nvidia drivers are available but for this example I have selected `440.33.01` drivers with CUDA 10.2. This is because as of writing most data science and ML libraries are still compiled and distributed for CUDA 10.2 with CUDA 11 support being in the minority. Recompiling is also a painstaking process. As an example, compiling a working TensorFlow 2.x on CUDA 11 takes about 40 mins on an AWS `g4dn.16xlarge` instance.
Note
A lot of the tutorials for installing Nvidia drivers on the web assume that you are running a local desktop setup. Xorg / Gnome Desktop, etc are not needed to get a server working. `nouveau` drivers may also not be active for VGAoutput in a server setup.
Common Issues
Incorrect kernel headers or kernel dev packages: Be very careful with these. Sometimes the specific version for your linux distro may take a fair bit of searching to find. A default Kernel-devel / Kernel-headers install with yum or apt-get may also end up updating your kernel to a newer version too after which you will have to ensure that Linux is booting with the correct kernel
Not explicitly setting your driver and CUDA versions: This can result in a yum update or apt-get update automatically updating everything to the latest and greatest and breaking all the libraries in your code which have only been compiled for an older CUDA version.
Validating Nvidia Driver Install
Before proceeding to enable GPU on YARN, check that your Nvidia drivers are installed correctly:
Run nvidia-smi, it should list out the GPUs on your system. If nothing shows up, then your drivers were not properly installed. If this is showing nothing or erroring out, there are a few things to check.
Run dmesg | grep nvidia to see if Nvidia drivers are being started on boot:
Setting up RAPIDs on Spark on CDP
Now that we have the drivers installed, we can work on getting Spark 3 on YARN with GPUs scheduling enabled.
Note: CSD 3 should be installed onto your cluster first. Refer to CDS 3 Docs.
Enabling GPU in YARN
With Spark enabled, we need to enable GPUs on YARN. For those who have skimmed the Apache docs for enabling GPUs on YARN, these instructions will differ as Cloudera Manager will manage changes to configuration files like resource-types.xml for you.
To schedule GPUs, we need GPU isolation and for that we need to leverage Linux Cgroups. Cgroups are Linux controls to help limit the hardware resources that different applications will have access to. This includes CPU, RAM, GPU, etc, and is the only way to truly guarantee the amount of resources allocated to a running application.
In Cloudera Manager, enabling Cgroups is a host-based setting that can be set from the Hosts >> Hosts Configuration option in the left side toolbar.
Search for Cgroups to find the tickbox.
Selecting the big tickbox will enable it for all nodes, but it can also be set on a host-by-host basis through the Add Host Overrides option. For more details on the finer details of the Cgroup settings, refer to CDP-PvC Base 7.1.4 cgroups documentation.
With Cgroups enabled, we now need to turn on Cgroup Scheduling under the YARN service.
Go to the YARN service Cloudera Manager
Go to Resource Management, and then search for cgroup
Tick the setting Use CGroups for Resource Management
Turn on Always Use Linux Container Executor
Now we can enable GPU on YARN through the Enable GPU Usage tickbox. You can find that under Yarn >> Configuration then the GPU Management category.
Before you do that, however, it is worth noting that it will by default enable GPUs on all the nodes which means all your YARN Node Managers will have to have GPUs. If only some of your nodes have GPUs, read on!
YARN Role Groups
As discussed, Enable GPU Usage would enable it for the whole cluster. That means that all the Yarn nodes would have to have GPUs and for a large sized cluster that could get very expensive. There are two options if you have an expense limit on the company card.
We can add a small dedicated compute cluster with all GPU nodes on the side. But then users would have to connect to two different clusters depending on if they want to use GPUs or not. Now that can in itself be a good way to filter out users who really really need GPUs and can be trusted with them versus users that just want to try GPUs cause it sounds cool but that is a different matter. If you wish to pursue that option then follow the instructions in Setting Up Compute Clusters. Once your dedicated compute cluster is up and running you will have to follow the previous steps to enable CGroups, Cgroup scheduling and then turn on Enable GPU Usage` in YARN.
The other option is to stick with one cluster whilst adding a GPU or two is to create Role Groups. These allow for different Yarn configs on a host by host basis. So, some can have Enable GPU Usage turned on whilst others do not.
Role groups are configured on the service level. To setup a GPU role group for YARN, do the following:
In Cloudera Manager navigate to Yarn >> Instances. You will see the Role Groups button just above the table listing all the hosts.
Click Create a role group. In my case, I entered the Group Name gpu nodes of the Role Type NodeManager and I set the Copy From field to NodeManager Default Group.
With our role group created, it is time to assign the GPU hosts to the gpu nodes group. Select the nodes in the NodeManager Default Group that have GPUs and move them to the new GPU Node group.
With our role group created, we will be able to Enable GPU Usage just for the GPU nodes. For more details on role groups, refer to Configuring Role Groups . Note you may need to restart the cluster first.
Adding Rapids Libraries
To leverage Nvidia RAPIDS, YARN and the Spark executors have to be able to access the Spark RAPIDS libraries. The required jars are present in Nvidia RAPIDs Getting Started.
In my sample ansible code, I have created a /opt/rapids folder on all the YARN nodes. Spark also requires a discover GPU resources script too. Refer to Spark 3 GPU Scheduling I have also included the getGpusResources.sh script under ansible_cdp_pvc_base/files/getGpusResources.sh in this repo: cdp_rapids.
Verifying YARN GPU
Now that we have everything set up, we should be able to see yarn.io/gpu listed as one of the usage meters on the Yarn2 cluster overview screen. If you do not see this dial, it means that something in the previous config has gone wrong.
Launching Spark shell
We now finally have everything configured, now we can launch GPU-enabled Spark shell.
Run the following on an edge node. Note the following commands and assume that
/opt/rapids/cudf-0.15-cuda10-2.jar, /opt/rapids/rapids-4-spark_2.12-0.2.0.jar and /opt/rapids/getGpusResources.sh exist. You will need them on each GPU node as well. Also, you will need different ".jar" files depending on the CUDA version that you have installed.
On a Kerberised cluster, you will need to firstly kinit as a valid kerberos user first for the following to work. You can use klist to verify that you have an active kerberos ticket.
To start your Spark shell, run the following.
Note: Depending on your executor setup and the amount of CPU and RAM you have available, you will need to set different settings for driver-core, driver-memory, executor-cores and executor-memory.
export SPARK_RAPIDS_DIR=/opt/rapids
export SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}/cudf-0.15-cuda10-2.jar
export SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-0.2.0.jar
spark3-shell \
--master yarn \
--deploy-mode client \
--driver-cores 6 \
--driver-memory 15G \
--executor-cores 8 \
--conf spark.executor.memory=15G \
--conf spark.rapids.sql.concurrentGpuTasks=4 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.rapids.sql.enabled=true \
--conf spark.rapids.sql.explain=ALL \
--conf spark.rapids.memory.pinnedPool.size=2G \
--conf spark.kryo.registrator=com.nvidia.spark.rapids.GpuKryoRegistrator \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.shims-provider-override=com.nvidia.spark.rapids.shims.spark301.SparkShimServiceProvider \
--conf spark.executor.resource.gpu.discoveryScript=${SPARK_RAPIDS_DIR}/getGpusResources.sh \
--jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR}
Breaking down the Spark Submit command
Here are some brief tips to help you understand the command:
spark submit
Note: This is not definitive and do consult the RAPIDS.ai and Spark 3 documentation for further details.
The GPU on G4 instance I had has 16GB VRAM. So I upped executor memory to match so that there is some data buffering capability in the executor. Since GPUs are typically RAM constrained, it is important to try and minimise the amount of time that you are bottlenecking waiting for data. There is no science around how much to up the executor memory. It just depends on your application and how much IO you have going to and from the GPU. Also non-GPU friendly ops will be completed by the executor on CPU so ensure that the executors are beefy enough to handle those. Same with executor cores, I upped it to a recommended level for pulling and pushing data from HDFS.
spark.rapids.sql.explain=ALL : It helps to highlight which parts of a spark operation can and can't be done on GPU. Example of operations which aren't current currently be done on GPU include: Regex splitting of strings, Datetime logic, some statistical operations. Overall the less non supported operations that you have better the performance.
spark.rapids.shims-provider override=com.nvidia.spark.rapids.shims.spark301.SparkShimServiceProvider : The reality of the Spark ecosystem today is that not all Spark is equal. Each commercial distribution can be slightly different and as such the "shims" layer provides a kind of mapping to make sure that rapids performs seamlessly. Currently, RAPIDs doesn't have a Cloudera shim out of the box, but Cloudera's distribution of Spark 3 closely mirrors the open source. So we need explicitly specify this mapping of the open-source Spark 3.0.1 Shim to the Cloudera CDS parcel for now. And for the observant, yes, this is version-specific, and as Spark 3 matures, we will need to ensure the current Shim layer is loaded. This will be rectified in a later RAPIDs version
spark.rapids.memory.pinnedPool.size : in order to optimise for GPU performance, Rapids allows for "pinning" memory specific to assist in transferring data to the GPU. GPU that is "pinned" for this purpose won't be available for other operations.
The other flags are pretty self explanatory. It is worth noting that Spark RAPIDS sql needs to be explicitly enabled. Refer to the official RAPIDs tuning guide for more details: Nvidia RAPIDs tuning guide
Conclusion
Spark RAPIDs is a very promising technology that can help to greatly accelerate big data operations. Good Data Analysis like other intellectual tasks requires getting into the "flow" or so call it "into the zone". Interruptions, like waiting for 10 minutes for your query to run, can be very disruptive to this. RAPIDs is a good acceleration package in other to help reduce these wait times.
In its current iteration, however, it does still have quite a few limitations. There is no native support for datetime and more complicated statistical operations. GPUs outside of extremely high-end models are typically highly RAM bound so users need to be aware of and manage memory well. Managing memory on GPUs, however, is more an art than a science and requires good understanding of GPU technology and access to specialist tools. Support for data formats is also not as extensive as with native Spark. There is a lot of promise however, and it will be good to see where this technology goes. Happy Coding!
... View more
03-24-2020
02:46 AM
1 Kudo
Running custom applications in CDSW/CML
With the recent addition of Analytical Applications, data scientists and data engineers can now deploy their own custom applications and share them with other users.
Whilst simple applications may have all necessary code already baked into one script file, more complicated applications may require custom launchers, run flags that are set on execution and parameters that may be instance specific. In order to run these applications, we can leverage the Python subprocess module to run the commands that we would normally have manually entered into the terminal.
Getting Started
For this demonstration, I will show how to run Tensorboard and Hiplot, where both allow for the visualization of parameters from multiple runs of deep learning models.
Both applications rely on a custom command to trigger them as standalone applications: tensorboard for Tensorboard and hiplot for Hiplot.
Requirements
CDSW 1.7.1 or later versions
Python 3 engines with web access available (for installing libraries)
Setup
Click Create Project: (Git clone from github.com- running-custom-applications)
Click Open Workbench:
Launch a new Python Session:
First, ensure that all the required libraries are installed. From the CML/CDSW IDE run: bash !pip3 install -r requirements.txt
Packages that are installed in a session will be preserved for use by the project across all sessions.
Here, I have created two-run scripts to start the apps: For Hiplot: # Hiplot
import os
import subprocess
subprocess.call(['hiplot', '--host', '127.0.0.1', '--port', os.environ['CDSW_APP_PORT'] ]) I save this out in the run-hiplot.py script. The os.environ["CDSW_APP_PORT] command calls the environment variable CDSW_APP_PORT which specifies which port the application must use in order to run successfully. For Tensorboard: # Tensorboard
import os
import subprocess
subprocess.call(["tensorboard", " --logdir=logs/fit", "--host", "127.0.0.1", "--port", os.environ["CDSW_APP_PORT"]]) I save this out in the run-tensorboard.py script.
Notice that adding a flag is as simple as adding the flag and its settings, as a part of the comma separated list within the subprocess.call([ ... ]) command.
For this demonstration, I will generate some data to populate tensorboard first.
Run test_runs_tensorflow.py in the CDSW/CML Session by opening the py file, then click the play arrow in the coding window.
Running applications that require flags
Now that we have the script, we can go ahead and trigger it as an application:
Go the applications screen:
Click New Application:
Fill in the form: Note: The option to Set Environment Variables just before the Create Application button. Leveraging os.environ[''] and the ability to set environment variables from the New Application screen, it is still possible to edit run flags without editing the run script.
Click Create:
To access the application, click the box with the arrow:
Conclusion
The new Analytical Applications function rolled out in CDSW 1.7.x and available in CML - Public Cloud enables the deployment of third-party and custom applications on Cloudera Machine Learning infrastructure.
Through the use of the Python subprocess module, it is possible to execute arbitrary code and set runtime flags for applications as well.
Happy Coding!
... View more