What is the distributed cache in Hadoop, how to find how much distributed cache allocated for a running job ?
After successful completion of job what will happen to that distributed cache ? is there any property for distributed cache allocated from a single data node(single disk).
Thanks in advance !!
Distributed cache is a mechanism supported by hadoop mapreduce framework where we can broadcast small or moderate sized files (read only) to all the worker nodes where the map/reduce tasks are running for a given job.
Each worker node that runs the tasks of a given job will have one copy of the file(s) sent via Distributed cache. It is possible to control the size of distributed cache with cache size property in mapred-site.xml
After successful run of the job, the distributed cache files (these are temporary files) will be deleted from worker nodes.
Hi Anatva ,
Thanks for sharing your knowledge!!
Is there any way where i can see the amount of distributed cache allocated for a job running in Resource Manager web UI (RM).
Vijay, the submitter of the job or the programmer would add a file or a zip file to the dist cache of an MR job. Hence the user/developer would know the size of the dist cache. In RM UI, once you expand the job xml for a specific job, you can see the location of the files that are broadcasted via distributed cache. Please let me know if this answered your question. Otherwise please post your question with an example