Member since
03-12-2020
2
Posts
1
Kudos Received
0
Solutions
04-08-2020
12:33 PM
The Extreme Gradient Boosting or Xgboost has received a lot of attention recently and has been the star of the show across many data science workflows and competitions on Kaggle. It is a decision-tree-based model that creates hundreds of “weak learners,” i.e. trees that do not overfit the data. The final predictions or estimations are made using this group of trees, thereby reducing overfit and increasing generalization on unseen data.
For many machine learning use cases, especially in enterprises with exponential amounts of data, Xgboost is a great candidate for distributed ML model building, as it has been proven to scale well linearly along with the number of parallel instances. To achieve this, we can use DASK, an open-source parallel computing framework, inside of CML. While you can use other frameworks like Spark, we chose DASK because it's written natively in Python (Spark is written in Java) and doesn't have the overhead of running JVMs and context switching between Python and Java. It's also easier to debug by using the Python stack trace that comes from DASK.
In the following video on Using distributed Xgboost with DASK in Cloudera Machine Learning, we will learn how to use distributed Xgboost inside of CML to supercharge your ML models and drive results faster.
Follow along with the project code here!
You can also manage your distributed Xgboost model training using your DASK Dashboard directly from CML. Learn more in the following video:
Looking for more? Check out the full blog post and walkthrough and share your projects with the community!
References
Extreme Gradient Boosting
DASK
... View more
03-19-2020
07:27 AM
1 Kudo
Just Released: Cloudera Data Science Workbench (CDSW) With Support for CDP Data Center
CDSW 1.7.2 is now available, bringing support for CLoudera Data Platform Data Center Edition 7.0 (CDP-DC) and improved usability for experiments and model deployment. CDSW on CDP-DC enables an improved best of breed data science platform experience for your data science teams on our latest enterprise data platform.
Whether you’ve already begun to migrate to CDP Data Center or are just starting your journey to the cloud, CDSW on CDP-DC enables continuity in your data science experience and a more frictionless path to Cloudera Machine Learning (CML) — Our enterprise cloud-native data science and machine learning platform for CDP Private Cloud and Public Cloud.
Support for virtual clusters in CDP Data Center
Customers using CDP-DC as a stepping stone in their journey to Private Cloud can now reduce impact to other workflows and noisy neighbors by leveraging CDSW in isolated virtual clusters with assigned resources.
Also in this release: - Ability to select environment variables in model and experiment builds, giving data science teams more control when testing and deploying models into production - Other minor fixes: web interface fixes, minor usability improvements. Read the release announcement for full bug fixes.
Links:
Download it here. Upgrade with the Cloudera Manager
CDSW Overview
Getting started with CDSW
As always, we welcome your feedback. Please send your comments and suggestions on our community forums.
... View more
Labels: