Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What is Distributed Cache in Hadoop ?

What is Distributed Cache in Hadoop ?

New Contributor

What is the distributed cache in Hadoop, how to find how much distributed cache allocated for a running job ?

After successful completion of job what will happen to that distributed cache ? is there any property for distributed cache allocated from a single data node(single disk).

Thanks in advance !!

3 REPLIES 3

Re: What is Distributed Cache in Hadoop ?

New Contributor

Distributed cache is a mechanism supported by hadoop mapreduce framework where we can broadcast small or moderate sized files (read only) to all the worker nodes where the map/reduce tasks are running for a given job.

Each worker node that runs the tasks of a given job will have one copy of the file(s) sent via Distributed cache. It is possible to control the size of distributed cache with cache size property in mapred-site.xml

After successful run of the job, the distributed cache files (these are temporary files) will be deleted from worker nodes.

Re: What is Distributed Cache in Hadoop ?

New Contributor

Hi Anatva ,

Thanks for sharing your knowledge!!

Is there any way where i can see the amount of distributed cache allocated for a job running in Resource Manager web UI (RM).

Highlighted

Re: What is Distributed Cache in Hadoop ?

New Contributor

Vijay, the submitter of the job or the programmer would add a file or a zip file to the dist cache of an MR job. Hence the user/developer would know the size of the dist cache. In RM UI, once you expand the job xml for a specific job, you can see the location of the files that are broadcasted via distributed cache. Please let me know if this answered your question. Otherwise please post your question with an example