question Re: What's the best way to do Monte Carlo simulation on Hadoop in Archives of Support Questions (Read Only)

What's the best way to do Monte Carlo simulation on Hadoop

pcoates — Wed, 18 Nov 2015 22:32:08 GMT

Monte Carlo and is one of many simulation types that execute a huge amount of repetitive tasks that use relatively little data. The "data" is usually little more than sets of parameters to a function that must be executed a zillion times. Often this is followed by some kind of summarizing process. Clearly a custom MR job can be written for this, but is there any kind of standard frameworks that HDP recommends, or a published set of best practices?

Re: What's the best way to do Monte Carlo simulation on Hadoop

nsabharwal — Wed, 18 Nov 2015 22:33:44 GMT

@Peter Coates This was brought by couple of DS guys. We discussed using Spark link

Re: What's the best way to do Monte Carlo simulation on Hadoop

bsaini — Thu, 19 Nov 2015 01:40:38 GMT

To add to this, as a rule of thumb, Spark is the best choice when it comes to executing iterative algorithm. It helps that there is inbuilt ML Lib. I haven't seen anyone writing MR by hand anymore (except recently met one of the customers of our competitors because they were misled into believing 'hive is slow'.).

Re: What's the best way to do Monte Carlo simulation on Hadoop

dkumar1 — Sat, 05 Dec 2015 02:16:47 GMT

@Peter Coates

why do you need Spark if the data is very small and can fit on a single node? There are other excellent Monte Carlo simulation packages which can do this efficiently -- open source or otherwise. Even Excel has an add-in for this.

edit: If you need more horsepower for Monte Carlo simulations which one node can't provide, you can look at MPI. Mpich is pretty good: https://www.mpich.org/ There's even a Yarn adapter for Mpich: https://github.com/alibaba/mpich2-yarn

Re: What's the best way to do Monte Carlo simulation on Hadoop

dkumar1 — Sat, 05 Dec 2015 02:19:00 GMT

@bsaini

Iterative computations are best in Spark for large data sets, not for CPU bound processes which use a small data set repeatedly.

Re: What's the best way to do Monte Carlo simulation on Hadoop

aervits — Tue, 02 Feb 2016 09:48:07 GMT

@Peter Coates can you accept the best answer to close this thread?

Re: What's the best way to do Monte Carlo simulation on Hadoop

vzlatkin — Fri, 03 Jun 2016 23:11:19 GMT

Here is an example: https://community.hortonworks.com/articles/36321/predicting-stock-portfolio-losses-using-monte-carl.html