Support Questions
Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Cluster sizing guidance for MLlib

New Contributor

Hi, I'm trying to plan a cluster out for a production environment that will be doing machine learning with MLlib.  Our data scientists will be using CDSW and initially doing topic modeling with LDA.  We have an LDA job running on 200gGB-ish of data which is taking over a week to complete.  This is running on 7 EC2 instances with 16 cores, 64 GB each and on HDFS storage.  The job is using 99 cores and they bounce between 50% - 80% utilizations.


Does this sound like a severly undersized cluster for this type of workload?  Our data scientist is asking for ~2000 cores to be able to run this is under 1 day.  I'm not finding much online about proper mllib sizing.


Thanks for any input/help.


New Contributor
I meant to post this in the Spark forum, sorry for cross-posting. Still, if anyone has insight I'd appreciate it.