Community Articles

Find and share helpful community-sourced technical articles.
avatar
Cloudera Employee

The Extreme Gradient Boosting or Xgboost has received a lot of attention recently and has been the star of the show across many data science workflows and competitions on Kaggle. It is a decision-tree-based model that creates hundreds of “weak learners,” i.e. trees that do not overfit the data. The final predictions or estimations are made using this group of trees, thereby reducing overfit and increasing generalization on unseen data.

 

For many machine learning use cases, especially in enterprises with exponential amounts of data, Xgboost is a great candidate for distributed ML model building, as it has been proven to scale well linearly along with the number of parallel instances. To achieve this, we can use DASK, an open-source parallel computing framework, inside of CML. While you can use other frameworks like Spark, we chose DASK because it's written natively in Python (Spark is written in Java) and doesn't have the overhead of running JVMs and context switching between Python and Java. It's also easier to debug by using the Python stack trace that comes from DASK.

 

In the following video on Using distributed Xgboost with DASK in Cloudera Machine Learning, we will learn how to use distributed Xgboost inside of CML to supercharge your ML models and drive results faster. 

 

 

Follow along with the project code here!

 

You can also manage your distributed Xgboost model training using your DASK Dashboard directly from CML. Learn more in the following video: 

Looking for more? Check out the full blog post and walkthrough and share your projects with the community! 

References

2,218 Views
0 Kudos