02-20-2018 12:10 PM
Hi, I'm trying to plan a cluster out for a production environment that will be doing machine learning with MLlib. Our data scientists will be using CDSW and initially doing topic modeling with LDA. We have an LDA job running on 200gGB-ish of data which is taking over a week to complete. This is running on 7 EC2 instances with 16 cores, 64 GB each and on HDFS storage. The job is using 99 cores and they bounce between 50% - 80% utilizations.
Does this sound like a severly undersized cluster for this type of workload? Our data scientist is asking for ~2000 cores to be able to run this is under 1 day. I'm not finding much online about proper mllib sizing.
Thanks for any input/help.