Member since
07-09-2015
70
Posts
29
Kudos Received
12
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
12020 | 11-23-2018 03:38 AM | |
2853 | 10-07-2018 11:44 PM | |
3553 | 09-24-2018 12:09 AM | |
5683 | 09-13-2018 02:27 AM | |
3847 | 09-12-2018 02:27 AM |
09-30-2022
06:25 AM
Cloudera Machine Learning now provides a built-in dashboard for monitoring technical metrics relating to deployed CML Models, such as request throughput, latency, and resource consumption. When machine learning models are deployed in production, it’s essential to know whether the model is successfully providing responses to all queries within the required timeframe and to be able to investigate and find the root cause for any failed responses or other downstream issues. Further, it can be challenging to know ahead of time the resource requirements for the model, such as the number of replicas, and amount of memory and CPU allocated to each replica. This can make it difficult to find the balance between the risk of underprovisioning resources leading to slow responses or timed out requests, and the risk of unnecessarily wasting resources that could be used for other workloads. The new monitoring features for CML Models provide observability that makes these challenges much easier to manage, allowing ML Engineers to be confident that their Model is right-sized and performing within SLAs. The dashboard is available to any end-user with access to the Model, and allows users to view these technical metrics over custom time windows, either aggregated or per-replica, making it easier for developers to understand the resource needs of their Models and monitor the health of production deployments. To view the dashboard, select the Monitoring tab of the deployed Model.
... View more
Labels:
08-01-2022
12:29 AM
Over the last few quarters, more and more of our customers deployed their production workloads on Cloudera Machine Learning. Some of them rely on CML with their predictive maintenance use case, others predict churn or detect fraudulent transactions. The common thing between them is that the ml workloads running on CML are critical for their business’s success. The new CML Backup and Restore capability gives you an extra layer of protection by giving you the ability to resume operations in a timely manner following an outage or crisis. Now, administrators can take on-demand backups of CML workspaces before cluster operations like an upgrade and do periodic backups during off-peak hours. Backed-up CML workspaces can be restored into a new CML workspace in the same or a new CDP environment, and all project artifacts like deployed models and applications will be recovered. The Backup and Restore capabilities are available on AWS today, and we are planning to roll out the same capabilities on Azure in the future. To learn more about these new capabilities, visit our documentation.
... View more
Labels:
07-22-2022
08:51 AM
Data Scientists have access to a wide range of ML Runtimes, they can use different versions of Python, R, and Scala kernels with the Workbench or the JupyterLab editors and can benefit from extended capabilities like GPU acceleration. Cloudera continues to release new versions of ML Runtimes, while customers can also register custom ones they build to solve their specific business use case. With the continuous additions, the available options can grow large, and some will become irrelevant or outdated. CML now supports administrators to disable Runtime variants or specific versions. For example, they can decide to disable all of the Python 3.6 Runtimes as the python kernel is officially EOL and won't receive any further security and bugfix patches. Usage of these runtimes can be considered a security risk that administrators now can solve. Once administrators disable an ML Runtime, data scientists won't be able to use them for development, and existing workloads configured with them will also fail to start.
... View more
Labels:
07-15-2022
09:52 AM
1 Kudo
This article explains how to configure Spark Connections in Cloudera Machine Learning. CML enables easy Spark data connections to the data stored in the Data Lake by abstracting the SparkSession connection details. CML users don't need to set complicated endpoint and configuration parameters or load HWC or Iceberg libraries. They can use the cml.data library to get preconfigured connections. CML Data Connection Snippets The spark SparkSession object has all the necessary options set to make connections work. Users have two options to set custom or job-specific configurations like executor CPU and memory parameters. 1. Specify the configuration inline With SparkContext's setSystemProperty method, you can set Spark properties that will be picked up while building the SparkSession object. You can set any Spark property like below before calling cmldata's get_spark_session method. Python script import cml.data_v1 as cmldata
from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.cores', '4')
SparkContext.setSystemProperty('spark.executor.memory', '8g')
CONNECTION_NAME = "go01-aw-dl"
conn = cmldata.get_connection(CONNECTION_NAME)
spark = conn.get_spark_session() 2. Use spark-defaults.conf By placing a file called spark-defaults.conf in your project root (/home/cdsw/), you can set Spark properties for your SparkSession. The properties configured in this file will be automatically appended to the global Spark defaults. spark-defaults.conf spark.executor.cores=4
spark.executor.memory=8g Python script import cml.data_v1 as cmldata
CONNECTION_NAME = "go01-aw-dl"
conn = cmldata.get_connection(CONNECTION_NAME)
spark = conn.get_spark_session() Conclusion CML supports flexibility by offering two ways to configure its Spark connection snippets. The inline option helps users who want to fine-tune their Spark applications and want to configure different properties per Spark job. The spark-defaults.conf option helps users who want to set Spark properties that are applied for the whole Project. To read more about the Data Connections feature, read the following blog post: https://blog.cloudera.com/one-line-away-from-your-data/
... View more
06-17-2022
03:47 AM
Data scientists on CML Workspaces have access to GPUs to accelerate their machine learning projects and reduce the time it takes to build and train predictive models. NVIDIA GPU nodes are available for administrators to configure for CML Workspaces in both AWS and Azure. CML now supports adding new GPU nodes to existing CML Workspaces created without GPUs, so data scientists can access GPU acceleration without having to recreate CML Workspaces. Administrators can also replace GPU nodes in CML Workspaces to switch to the latest generation GPUs. With these new capabilities, it's easier for administrators to manage GPU nodes in CML Workspaces and enable data scientists to use the newest generation of GPUs.
... View more
Labels:
06-03-2022
06:47 AM
The Data Discovery and Visualization experience ships with preconfigured Data Connections, a database browser, interactive SQL editor, drag-and-drop Visual Dashboarding, and Connection Snippets. These new capabilities speed up the development process by cutting down the time spent finding, exploring, understanding, and accessing the data. Data Scientists need to fully understand their data in order to analyze it properly, build models, and power ML use cases. To reduce the time to insights, CML ships all tools required to integrate these tools to reduce the friction between the different steps and to speed up the development process for data science teams. These new capabilities are built on top of Cloudera Data Visualization, giving state-of-the-art visual capabilities at the hand of Data Scientists. To get started, you can step into any Project in a CML May or newer Workspace and hit the Data tab. You can read more about the new capabilities in the documentation here.
... View more
Labels:
05-25-2022
07:03 AM
The ML Runtimes 2022.04-1 Release includes a technical preview version of the new workbench architecture, the PBJ (Powered by Jupyter) Workbench. In the previous Workbench editor, the code and output shown in the console (the right-hand pane in the image below) were passed to and from Python, R, or Scala via a Cloudera-specific, custom messaging protocol. In the PBJ Workbench, on the other hand, the code and output are now passed to and from the target language runtime via the Jupyter messaging protocol. They are handled inside the runtime container by a Jupyter kernel and rendered in your browser by JupyterLab’s client code. This may seem like a subtle change, but it will provide CML users with some major benefits. First, the behavior of user code and third-party libraries on CML will be more consistent with its behavior in Jupyter-based environments. That means that a wider variety of rich visualization libraries will work out of the box, and in cases where rich visualization libraries do not work, error messages in the CML console and the browser console will be easier to google. Likewise, dependency conflicts between kernel code and user code will be rarer, and when they do occur they will be easier for customers to diagnose and fix. To give you a taste of what this higher degree of consistency is like, note that Python 3’s input() function now works. Go ahead and try it out! Second, customers will no longer need to build runtime images starting from Cloudera base images and will no longer need to restrict themselves to languages and versions that Cloudera has packaged. Any combination of base image, target language, and language version can be used with the PBJ Workbench as long as a Jupyter kernel is available for that combination. You can try it out by running a PBJ Workbench Python session using a CML April or newer Workspace. The look and feel of the workbench will be more or less unchanged. Under the hood, however, the way that code and outputs are rendered and passed between the web app and the Python interpreter have been re-engineered to better align with the Jupyter ecosystem. The Technical Preview documentation is available here.
... View more
Labels:
04-08-2022
02:29 AM
2 Kudos
This article explains how to use the Snowflake Connector for Spark in Cloudera Machine Learning.
Save your Snowflake password for your account. Go to Account Settings > Environment Variables and create a new entry with Name as "SNOW_PASSWORD" and Value as your <password>.
Create a new session in your CML project. Use any of the editors, and the Python 3.7 kernel. You also need to enable Spark for this session and select the Spark 2.4.7 version.
Download required dependencies. To initiate the connection, you need to have a Snowflake Connector for Spark and a Snowflake JDBC Driver that's compatible with CML's Spark version. You can get these from the Maven Central Repository: spark-connector and jdbc-driver. Place them in a ./jars folder in your CML Project. cdsw@ysezv3fm94edq4pb:~$ ls -l ./jars/
total 28512
-rw-r--r-- 1 cdsw cdsw 28167364 Mar 17 21:03 snowflake-jdbc-3.13.16.jar
-rw-r--r-- 1 cdsw cdsw 1027154 Jan 27 01:18 spark-snowflake_2.11-2.9.3-spark_2.4.jar
Initiate the Snowflake connection. You need to set your Snowflake Account ID and your username for the connection. Your Snowflake password is retrieved from the environment variable that you configured in the Account Settings. I'm using the default Snowflake Warehouse and a sample database/schema. import os
from pyspark.sql import SparkSession
sfOptions = {
"sfURL" : "<Account ID>.snowflakecomputing.com",
"sfUser" : '<Username>',
"sfPassword" : os.environ['SNOW_PASSWORD'],
"sfDatabase" : 'SNOWFLAKE_SAMPLE_DATA',
"sfSchema" : 'TPCH_SF1',
"sfWarehouse" : 'COMPUTE_WH'
}
snowJars = [os.path.join('/home/cdsw/jars', x) for x in os.listdir('/home/cdsw/jars')]
spark = SparkSession.builder \
.appName("cml-spark-snowflake-test") \
.config("spark.jars", ",".join(snowJars)) \
.getOrCreate()
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
query = '''
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER limit 100
'''
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", query) \
.load()
df.show()
Execute the code; you can see the results as follows:
... View more
04-07-2022
07:17 AM
This article explains how to connect to Snowflake from Cloudera Machine Learning.
Save your Snowflake password for your account. Go to Account Settings > Environment Variables and create a new entry with Name: "SNOW_PASSWORD" and Value: <your password>
Create a new session in your CML Project. You can use any of the editors, and a Python kernel:
Install required Python packages pip install pandas snowflake-connector-python snowflake-connector-python[pandas]
Initiate the Snowflake connection. You need to set your Snowflake Account ID and your username for the connection. Your Snowflake password is retrieved from the environment variable that you configured in the Account Settings. I'm using the default Snowflake warehouse and a sample database/schema. import os
import pandas as pd
import snowflake.connector
conn = snowflake.connector.connect(
account='<Account ID>',
user='<Username>',
password=os.environ['SNOW_PASSWORD'],
warehouse='COMPUTE_WH',
database='SNOWFLAKE_SAMPLE_DATA',
schema='TPCH_SF1'
)
query = '''
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER limit 100
'''
pd.read_sql(query, conn)
Once you execute the code, you can see the results:
... View more
Labels:
12-03-2021
12:09 AM
With CML's multi-version Spark support, CML users can now run different versions of Spark side by side, even within a single project. Users can select to use Spark 3 in the most recent CML version and take advantage of performance and stability improvements in the latest version of Spark. Data Scientists can run workloads in both Spark 2 and Spark 3 within the same CML Workspace, thus maintaining backward compatibility with existing workloads while developing new applications on the latest version of Spark. Users can select the Spark version they want to use for each workload, making it easy to migrate older jobs using Spark 2 to Spark 3.
... View more
Labels: