I am new to hadoop and I am experimenting with the Centralized Cache Management in HDFS.
One thing I would like to understand more is the cache block report that at each heartbeat dn send to nn and full cache state report that it sends to nn whose frequency controlled by dfs.cachereport.intervalMsec.
The cache reports form the basis of awareness of cached block location information at the NameNode. It is basically a list of block IDs that are currently cached by the DataNode.
Delaying this will impact the availability of cached block locations in the information NameNode serves to its clients, when the state changes due to cache modification (add/remove/timers/etc.).
Since the changes to block cache are mostly asynchronously done, this should not impact any specific commands, but it can result in delayed or missed benefits to clients seeking cached locations of recently cached/uncached blocks depending on how far you delay the reports (default's every 10 seconds).
The regular DataNode heartbeats only send cache capacity statistics, not the actual block ID information.
The cache report should be a small list typically - an encoded array of block ID integers and shouldn't impact the NameNode in any significant way unless you have very large caches. Are you spotting an observance that is otherwise?