Created 10-05-2018 11:07 AM
I have developed a program code that executes an X task in sequential and in MAPREDUCE model. But I see that the running time taken by the mapreduce is bigger than the sequential one( It must be the opposite!!). I'm running mapreduce in a single node, so at least I should get the same running time?
Created 10-05-2018 01:28 PM
If you are running MapReduce on a single node, it will take more time than a sequential application due to the job creation overhead that MapReduce must undertake. There is extra time taken in the case of MapReduce to submit the job, copy the code and dependencies into a YARN container, and start the job. As you scale out to several nodes and more, you will see the performance benefits of MapReduce.
In general however, MapReduce is used less often now on the platform - Hive runs on Tez now rather than MapReduce and I've only seen MapReduce of late being used for things like bulk loading data into HBase/Druid. In-memory processing, the likes of which both Hive/LLAP and Spark provide, can net you a significant performance boost depending on what you're trying to accomplish and the tool best suited for the job.
Created 10-05-2018 01:28 PM
If you are running MapReduce on a single node, it will take more time than a sequential application due to the job creation overhead that MapReduce must undertake. There is extra time taken in the case of MapReduce to submit the job, copy the code and dependencies into a YARN container, and start the job. As you scale out to several nodes and more, you will see the performance benefits of MapReduce.
In general however, MapReduce is used less often now on the platform - Hive runs on Tez now rather than MapReduce and I've only seen MapReduce of late being used for things like bulk loading data into HBase/Druid. In-memory processing, the likes of which both Hive/LLAP and Spark provide, can net you a significant performance boost depending on what you're trying to accomplish and the tool best suited for the job.