Created 11-18-2015 02:32 PM
Monte Carlo and is one of many simulation types that execute a huge amount of repetitive tasks that use relatively little data. The "data" is usually little more than sets of parameters to a function that must be executed a zillion times. Often this is followed by some kind of summarizing process. Clearly a custom MR job can be written for this, but is there any kind of standard frameworks that HDP recommends, or a published set of best practices?
Created 11-18-2015 02:33 PM
@Peter Coates This was brought by couple of DS guys. We discussed using Spark link
Created 11-18-2015 02:33 PM
@Peter Coates This was brought by couple of DS guys. We discussed using Spark link
Created 11-18-2015 05:40 PM
To add to this, as a rule of thumb, Spark is the best choice when it comes to executing iterative algorithm. It helps that there is inbuilt ML Lib. I haven't seen anyone writing MR by hand anymore (except recently met one of the customers of our competitors because they were misled into believing 'hive is slow'.).
Created 12-04-2015 06:19 PM
Iterative computations are best in Spark for large data sets, not for CPU bound processes which use a small data set repeatedly.
Created 12-04-2015 06:16 PM
why do you need Spark if the data is very small and can fit on a single node? There are other excellent Monte Carlo simulation packages which can do this efficiently -- open source or otherwise. Even Excel has an add-in for this.
edit: If you need more horsepower for Monte Carlo simulations which one node can't provide, you can look at MPI. Mpich is pretty good: https://www.mpich.org/ There's even a Yarn adapter for Mpich: https://github.com/alibaba/mpich2-yarn
Created 02-02-2016 01:48 AM
@Peter Coates can you accept the best answer to close this thread?
Created 06-03-2016 04:11 PM