Community Articles

rhryniewicz · ‎07-21-2020

This article contains Questions & Answers on Cloudera Machine Learning (CML).

Is it possible to run CML on premises?

Yes, Cloudera Machine Learning is available on both CDP Public Cloud as well as CDP Private Cloud.

How is the deployment of models managed by CML Private Cloud vs CML Public Cloud?

Model deployment functions similarly across both form factors of CML; the models are built into containerized images, and then deployed on top of Kubernetes pods for production-level serving.

Is there a level of programming required for a data scientist to use this platform? What languages can developers use?

CML enables data scientists to write code in Python, R, or Scala in their editor of choice. Therefore, beginner data scientists can easily run sample code in the workbench, and more experienced data scientists can leverage open source libraries for more complex workloads.

Can you run SQL-like queries? E.g. with Spark SQL?

Yes, Spark SQL can be run from CML

Do pre-built models come out of the box?

While CML does not have built-in libraries of pre-built models, CML will soon come with reusable and customizable Applied Machine Learning Prototypes. These prototypes are fully built ML projects with all the code, models, and applications for leveraging best practices and novel algorithms., Additionally, CML is a platform upon which you can leverage the ML libraries and approaches of your choice to be able to build your own models.

Can I do AutoML with CML?

CML is designed to be the platform on top of which data scientists can bring the Python, R, and Scala libraries and packages they need to run their workloads. This includes being able to leverage open source technologies, such as AutoML, to be used within the Projects in CML. In addition, Cloudera is working with partners such as H2O to be able to further enable data scientists with specific AutoML distributions, as well as citizen data scientists who are looking for a more interactive ML experience.

What is your MLOps support?

CML’s MLOps capabilities and integration with SDX for models bring prediction and accuracy monitoring, production environment ground truthing, model cataloging, and full lifecycle lineage tracking.

Can the result/output of the ML model be available in CSV or Excel file for the business user to use it in a different platform?

Yes, you can certainly ensure that the output of models is available in the external format of your choice.

What about multiple file model projects?

CML lets you deploy multiple models in a project and allows for complex dependency management through the analytical job scheduling functionality.

What about model access monitoring? Does CML log directly all REST access?

Yes, all access to the models is logged. CML’s MLOps also enables fine-grained tracking of model telemetry for monitoring drift and quality. We have also implemented comprehensive security mechanisms on-top of models so that each request can be comprehensively audited.

Does CML support automated model tuning?

Yes. CML supports AutoML, Teapot, and other automation frameworks. CML also has a comprehensive API for managing experiments, models, jobs, and applications. MLOps brings tracking and monitoring metrics in model build and deployment so that model performance and drift can be managed. CML Jobs can then be used to retrain models if their performance falls outside the desired range.

What technologies does CDP CML use? Mahout, TensorFlow, others?

CML takes a bring your own framework approach. We support Scala, Python, and R frameworks by default, so libraries such as TensorFlow, Dask, Sparklyr simply need to be installed to be usable.

Do Jupyter notebooks come with CML?

Yes, data scientists can use Jupyter notebooks on top of CML. In addition, CML also has the flexibility to enable and use other editors, e.g. R Studio or PyCharm.

Do R programs for CML also run in parallel on the CDP?

Yes. CML supports R and can be run in parallel on CDP using the sparklyr library.

How are the Python packages handled in CML?

You are able to install your own Python libraries and packages into your session. Either via Jupyter terminal or via the built-in editor with PIP or Conda install.

How easy it is to spin up and down different environments/workspaces?

From the CDP Management Console, it only takes a few clicks and a few minutes to be able to spin up and down different workspaces.

When a session ends. Do packages have to be re-installed?

No. The packages are saved with the project and shared between sessions. However, different projects will not share the same packages, thus keeping the environments separate.

Is Spark being used as part of the platform or part of CML?

Cloudera Machine Learning (CML) leverages Spark-on-K8s, enabling data scientists to directly manage the resources and dependencies necessary for the Spark cluster. Once the workload is completed, the Spark executors are spun down to free up the resources for other uses.

Can data scientists bring their own versions of Spark, Python, or R?

The core engine will have the latest versions of Spark, Python, and R, but you can further customize these engines and make those available to your data scientists.

Shall the engine profile be set by admins and disabled for data scientists?

Admins manage the engines available across the ML Workspace and data scientists choose which engine they need to use for each Project.

Can data scientists create their own workspaces?

Generally, it’s the data science admins who would create and manage these workspaces. However, it is possible to enable data scientists to do so as well through permissions.

How is data access handled in CML?

Data access is centralized and managed through its integration with Cloudera’s SDX. For example, in Spark, if I access data from a Data Warehouse or Kafka topic, the SDX services will determine my permissions to do so, apply the masking and filtering policies, and then fully audit and record the lineage of the access.

Do ID broker rules apply to Machine Learning experience as well?

Yes, they do.

How is CML different in CDP vs CDH (i.e. CDSW)?

CML expands the end-to-end workflow of Cloudera Data Science Workbench (CDSW) with cloud-native benefits like rapid provisioning, elastic autoscaling, distributed dependency isolation, and distributed GPU training. In addition, CML operates on top of CDP Public Cloud and CDP Private Cloud, while CDSW operates on top of CDH and HDP. More details can be found in our documentation here.

Cloudera Community

Community Articles

Cloudera Machine Learning (CML) - Questions & Answers

Apache Spark

Apache Zeppelin

Cloudera Data Science and Engineering

Cloudera Machine Learning (CML)

Installing Django in Cloudera Machine Learning (CM...

Spark in CML: Recommendations for using Spark in C...

How to set up CI-CD workflows in Cloudera Machine ...

Cloudera Data Platform (CDP) - Questions & Answers

Cloudera DataFlow (CDF) - Questions & Answers

Distributed XGBoost with PySpark in Cloudera Machi...

How to setup Model Registry on Cloudera Machine Le...

Cloudera Data Warehouse (CDW) - Questions & Answer...

How to host Apache Airflow within CDP Cloudera Mac...

Cloudera Operational Database (COD) - Questions & ...