Support Questions

Find answers, ask questions, and share your expertise

What's the best way to do Monte Carlo simulation on Hadoop

avatar
Rising Star

Monte Carlo and is one of many simulation types that execute a huge amount of repetitive tasks that use relatively little data. The "data" is usually little more than sets of parameters to a function that must be executed a zillion times. Often this is followed by some kind of summarizing process. Clearly a custom MR job can be written for this, but is there any kind of standard frameworks that HDP recommends, or a published set of best practices?

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Peter Coates This was brought by couple of DS guys. We discussed using Spark link

View solution in original post

6 REPLIES 6

avatar
Master Mentor

@Peter Coates This was brought by couple of DS guys. We discussed using Spark link

avatar

To add to this, as a rule of thumb, Spark is the best choice when it comes to executing iterative algorithm. It helps that there is inbuilt ML Lib. I haven't seen anyone writing MR by hand anymore (except recently met one of the customers of our competitors because they were misled into believing 'hive is slow'.).

avatar
Contributor
@bsaini

Iterative computations are best in Spark for large data sets, not for CPU bound processes which use a small data set repeatedly.

avatar
Contributor

@Peter Coates

why do you need Spark if the data is very small and can fit on a single node? There are other excellent Monte Carlo simulation packages which can do this efficiently -- open source or otherwise. Even Excel has an add-in for this.

edit: If you need more horsepower for Monte Carlo simulations which one node can't provide, you can look at MPI. Mpich is pretty good: https://www.mpich.org/ There's even a Yarn adapter for Mpich: https://github.com/alibaba/mpich2-yarn

avatar
Master Mentor

@Peter Coates can you accept the best answer to close this thread?

avatar