I have a tiny small cluster which will have 1.5 T data for 7 months.
In the cluster there is 4 main jobs:
1- 1st job run on few MBs around 10MB and writes the same data, the job run each half an hour, and the written daily accumlated job per day is 0.5G.
2- 2nd jobs run on few KBs around 10KB and writes the same data, the job run each half an hour, and the written daily accumlated job per day is 0.2G.
3- 3rd job reads the 1st job data and writes also few KBs.
4- 4th Job compacted the data of the first 2 jobs on daily basis and run once a day.
** i have also 2 adminstration jobs, one cleaner that delete the logs and the job files, the other is a retention one that run once a day.
the administration jobs only oozie launcher ones so they need 2 mappers only.
the average for the other jobs is 3-5 mappers and the same for the reducers, the block size is also 128 M.
I want to minimize the cost of the current cluster and move to vm boxes, the cluster not managed by cloudera but it has the old version of JT and TTs.
I'm fine with the masters servers but i want to replace only the workers, all the jobs are running less than 2 minutes and i have a good SLA for these jobs. I'm planning to have 6 vm servers each with 250GB/ 8 GB memory and 4 vcores each vm.
Do you think i can survive with this change, the overall cluster blocks are 150K and 250K files.
I know that hadoop prefer a stronger physical machines but i prefer not to invest all of money in the cluster.