Member since
07-09-2015
68
Posts
24
Kudos Received
12
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
8925 | 11-23-2018 03:38 AM | |
2297 | 10-07-2018 11:44 PM | |
2940 | 09-24-2018 12:09 AM | |
4638 | 09-13-2018 02:27 AM | |
2970 | 09-12-2018 02:27 AM |
07-22-2022
08:51 AM
Data Scientists have access to a wide range of ML Runtimes, they can use different versions of Python, R, and Scala kernels with the Workbench or the JupyterLab editors and can benefit from extended capabilities like GPU acceleration. Cloudera continues to release new versions of ML Runtimes, while customers can also register custom ones they build to solve their specific business use case. With the continuous additions, the available options can grow large, and some will become irrelevant or outdated. CML now supports administrators to disable Runtime variants or specific versions. For example, they can decide to disable all of the Python 3.6 Runtimes as the python kernel is officially EOL and won't receive any further security and bugfix patches. Usage of these runtimes can be considered a security risk that administrators now can solve. Once administrators disable an ML Runtime, data scientists won't be able to use them for development, and existing workloads configured with them will also fail to start.
... View more
Labels:
07-15-2022
09:52 AM
1 Kudo
This article explains how to configure Spark Connections in Cloudera Machine Learning. CML enables easy Spark data connections to the data stored in the Data Lake by abstracting the SparkSession connection details. CML users don't need to set complicated endpoint and configuration parameters or load HWC or Iceberg libraries. They can use the cml.data library to get preconfigured connections. CML Data Connection Snippets The spark SparkSession object has all the necessary options set to make connections work. Users have two options to set custom or job-specific configurations like executor CPU and memory parameters. 1. Specify the configuration inline With SparkContext's setSystemProperty method, you can set Spark properties that will be picked up while building the SparkSession object. You can set any Spark property like below before calling cmldata's get_spark_session method. Python script import cml.data_v1 as cmldata
from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.cores', '4')
SparkContext.setSystemProperty('spark.executor.memory', '8g')
CONNECTION_NAME = "go01-aw-dl"
conn = cmldata.get_connection(CONNECTION_NAME)
spark = conn.get_spark_session() 2. Use spark-defaults.conf By placing a file called spark-defaults.conf in your project root (/home/cdsw/), you can set Spark properties for your SparkSession. The properties configured in this file will be automatically appended to the global Spark defaults. spark-defaults.conf spark.executor.cores=4
spark.executor.memory=8g Python script import cml.data_v1 as cmldata
CONNECTION_NAME = "go01-aw-dl"
conn = cmldata.get_connection(CONNECTION_NAME)
spark = conn.get_spark_session() Conclusion CML supports flexibility by offering two ways to configure its Spark connection snippets. The inline option helps users who want to fine-tune their Spark applications and want to configure different properties per Spark job. The spark-defaults.conf option helps users who want to set Spark properties that are applied for the whole Project. To read more about the Data Connections feature, read the following blog post: https://blog.cloudera.com/one-line-away-from-your-data/
... View more
06-17-2022
03:47 AM
Data scientists on CML Workspaces have access to GPUs to accelerate their machine learning projects and reduce the time it takes to build and train predictive models. NVIDIA GPU nodes are available for administrators to configure for CML Workspaces in both AWS and Azure. CML now supports adding new GPU nodes to existing CML Workspaces created without GPUs, so data scientists can access GPU acceleration without having to recreate CML Workspaces. Administrators can also replace GPU nodes in CML Workspaces to switch to the latest generation GPUs. With these new capabilities, it's easier for administrators to manage GPU nodes in CML Workspaces and enable data scientists to use the newest generation of GPUs.
... View more
Labels:
06-03-2022
06:47 AM
The Data Discovery and Visualization experience ships with preconfigured Data Connections, a database browser, interactive SQL editor, drag-and-drop Visual Dashboarding, and Connection Snippets. These new capabilities speed up the development process by cutting down the time spent finding, exploring, understanding, and accessing the data. Data Scientists need to fully understand their data in order to analyze it properly, build models, and power ML use cases. To reduce the time to insights, CML ships all tools required to integrate these tools to reduce the friction between the different steps and to speed up the development process for data science teams. These new capabilities are built on top of Cloudera Data Visualization, giving state-of-the-art visual capabilities at the hand of Data Scientists. To get started, you can step into any Project in a CML May or newer Workspace and hit the Data tab. You can read more about the new capabilities in the documentation here.
... View more
Labels:
05-25-2022
07:03 AM
The ML Runtimes 2022.04-1 Release includes a technical preview version of the new workbench architecture, the PBJ (Powered by Jupyter) Workbench. In the previous Workbench editor, the code and output shown in the console (the right-hand pane in the image below) were passed to and from Python, R, or Scala via a Cloudera-specific, custom messaging protocol. In the PBJ Workbench, on the other hand, the code and output are now passed to and from the target language runtime via the Jupyter messaging protocol. They are handled inside the runtime container by a Jupyter kernel and rendered in your browser by JupyterLab’s client code. This may seem like a subtle change, but it will provide CML users with some major benefits. First, the behavior of user code and third-party libraries on CML will be more consistent with its behavior in Jupyter-based environments. That means that a wider variety of rich visualization libraries will work out of the box, and in cases where rich visualization libraries do not work, error messages in the CML console and the browser console will be easier to google. Likewise, dependency conflicts between kernel code and user code will be rarer, and when they do occur they will be easier for customers to diagnose and fix. To give you a taste of what this higher degree of consistency is like, note that Python 3’s input() function now works. Go ahead and try it out! Second, customers will no longer need to build runtime images starting from Cloudera base images and will no longer need to restrict themselves to languages and versions that Cloudera has packaged. Any combination of base image, target language, and language version can be used with the PBJ Workbench as long as a Jupyter kernel is available for that combination. You can try it out by running a PBJ Workbench Python session using a CML April or newer Workspace. The look and feel of the workbench will be more or less unchanged. Under the hood, however, the way that code and outputs are rendered and passed between the web app and the Python interpreter have been re-engineered to better align with the Jupyter ecosystem. The Technical Preview documentation is available here.
... View more
Labels:
04-08-2022
02:29 AM
2 Kudos
This article explains how to use the Snowflake Connector for Spark in Cloudera Machine Learning.
Save your Snowflake password for your account. Go to Account Settings > Environment Variables and create a new entry with Name as "SNOW_PASSWORD" and Value as your <password>.
Create a new session in your CML project. Use any of the editors, and the Python 3.7 kernel. You also need to enable Spark for this session and select the Spark 2.4.7 version.
Download required dependencies. To initiate the connection, you need to have a Snowflake Connector for Spark and a Snowflake JDBC Driver that's compatible with CML's Spark version. You can get these from the Maven Central Repository: spark-connector and jdbc-driver. Place them in a ./jars folder in your CML Project. cdsw@ysezv3fm94edq4pb:~$ ls -l ./jars/
total 28512
-rw-r--r-- 1 cdsw cdsw 28167364 Mar 17 21:03 snowflake-jdbc-3.13.16.jar
-rw-r--r-- 1 cdsw cdsw 1027154 Jan 27 01:18 spark-snowflake_2.11-2.9.3-spark_2.4.jar
Initiate the Snowflake connection. You need to set your Snowflake Account ID and your username for the connection. Your Snowflake password is retrieved from the environment variable that you configured in the Account Settings. I'm using the default Snowflake Warehouse and a sample database/schema. import os
from pyspark.sql import SparkSession
sfOptions = {
"sfURL" : "<Account ID>.snowflakecomputing.com",
"sfUser" : '<Username>',
"sfPassword" : os.environ['SNOW_PASSWORD'],
"sfDatabase" : 'SNOWFLAKE_SAMPLE_DATA',
"sfSchema" : 'TPCH_SF1',
"sfWarehouse" : 'COMPUTE_WH'
}
snowJars = [os.path.join('/home/cdsw/jars', x) for x in os.listdir('/home/cdsw/jars')]
spark = SparkSession.builder \
.appName("cml-spark-snowflake-test") \
.config("spark.jars", ",".join(snowJars)) \
.getOrCreate()
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
query = '''
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER limit 100
'''
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", query) \
.load()
df.show()
Execute the code; you can see the results as follows:
... View more
04-07-2022
07:17 AM
This article explains how to connect to Snowflake from Cloudera Machine Learning.
Save your Snowflake password for your account. Go to Account Settings > Environment Variables and create a new entry with Name: "SNOW_PASSWORD" and Value: <your password>
Create a new session in your CML Project. You can use any of the editors, and a Python kernel:
Install required Python packages pip install pandas snowflake-connector-python snowflake-connector-python[pandas]
Initiate the Snowflake connection. You need to set your Snowflake Account ID and your username for the connection. Your Snowflake password is retrieved from the environment variable that you configured in the Account Settings. I'm using the default Snowflake warehouse and a sample database/schema. import os
import pandas as pd
import snowflake.connector
conn = snowflake.connector.connect(
account='<Account ID>',
user='<Username>',
password=os.environ['SNOW_PASSWORD'],
warehouse='COMPUTE_WH',
database='SNOWFLAKE_SAMPLE_DATA',
schema='TPCH_SF1'
)
query = '''
select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER limit 100
'''
pd.read_sql(query, conn)
Once you execute the code, you can see the results:
... View more
Labels:
12-03-2021
12:09 AM
With CML's multi-version Spark support, CML users can now run different versions of Spark side by side, even within a single project. Users can select to use Spark 3 in the most recent CML version and take advantage of performance and stability improvements in the latest version of Spark. Data Scientists can run workloads in both Spark 2 and Spark 3 within the same CML Workspace, thus maintaining backward compatibility with existing workloads while developing new applications on the latest version of Spark. Users can select the Spark version they want to use for each workload, making it easy to migrate older jobs using Spark 2 to Spark 3.
... View more
Labels:
11-03-2021
01:22 AM
At Cloudera, we believe that data can make what is impossible today, possible tomorrow. There are many good uses of data. With data, we can monitor our business, the overall business, or specific business units. We can segment based on the customer verticals or whether they run in the public or private cloud. We can understand customers better, see usage patterns and main consumption drivers. We can find customer pain points, see where they get stuck, and understand how different bugs affect them. With data, we can discover new market opportunities, and review where we stand compared to the global market. We can track feature adoption, see how new features are picked up and what usage/consumption they generate. With data, we can set better goals, know where we are and where we want to go. And in the end, we can make better decisions. At Cloudera, we practice what we preach. As the Cloudera Data Platform (CDP) gains popularity and more and more customers make it a critical piece of their infrastructure, we set out to create the best data platform in the enterprise. Today, we will highlight a new feature that showcases one great example of using data in the service of our customers. I’m excited to share this feature because this is a success story! Late last year, we saw our customers struggling to get CML Workspaces up. The elevated escalation count put a strain on our engineering team, trials were slowed down, and even worse our customers had a very bad experience with our product. We needed to figure this out. We tried to understand “is this a systemic issue?” or “how widespread is this problem?”, and the results were alarming. Customers experienced issues more than half, 57% of the time. There are two phases of CML Workspace creation; first, we create a K8s Cluster via the liftie APIs - this is the ‘Provision’ step; second, we install the CML service. The above chart shows the workspace provisioning results broken down for CML releases between June ‘20 and Jan ‘21. Once we saw the results, we dug in and analyzed dozens of failure modes. We discovered that actual product bugs caused only a small portion of the failures. The most common failures we found were instance types requested in unsupported regions, failures due to conflicts between the admin-provided CIDR address ranges, and environments where CML Workspaces were failing due to an unhealthy DataLake. Okay, we identified the problem: we attempt to create CML Workspaces when we know they will surely fail. Preflight checks to the rescue. Liftie and CML engineers teamed up to solve this problem. They built a framework and released a series of checks over the course of the last few quarters to catch issues early. The results are astonishing. For the most recent - Aug ‘21 - release, customers experienced issues with the workspace creation just 7% of the time. For 39% of the attempts, we caught issues early and showed a meaningful error message, this saved hours and hours of work for support, engineering, and our customers. This was a data-driven project. We used data to qualify the problem, to understand the issues, and to measure our progress and the outcome. The result is a significantly more stable platform and a new framework that all other CDP Data Services will benefit from. Get started with Cloudera Machine Learning in CDP Now, you can start here.
... View more
07-15-2021
03:40 AM
Administrators can customize the Cloudera provided ML Runtimes to support Data Scientists’ specific use-cases. They can install additional OS packages, Python and R libraries, third-party drivers to enable connecting to external data stores, or even a new editor to be used. CML now enables registering these custom Runtimes and making them available for Data Scientists to use in their projects. Data Scientists have specific requirements for their working environments, they require a set of R or Python libraries and ready-made connections to fetch from third-party data stores. With the new feature, administrators can create custom Runtimes that data scientists can use in CML. To learn more, visit the documentation about Customized ML Runtimes.
... View more
Labels: