Community Articles

nkeshava · ‎03-05-2018

This article helps in estimating the total data change on hdfs for a given time period and what will be replicated by Data Lifecycle Manager (DLM)

Step 1)

On each of the data nodes, capture the jmx response by simply retrieving a json file from the node. Example something like this,

curl “http://<datanodeip/fqdn>:<port>/jmx”

(you can check the port in Ambari for the property - dfs.datanode.http.address)

Step 2)

In the json response, look for the value in the field “BytesWritten” which will be under a subsection, name" : "Hadoop:service=DataNode,name=DataNodeActivity-name of your datanode-50010". There should only be one entry/value for this.

Step 3)

The above value reflects the actual data change value (additions, deletions, modifications). Essentially, if you add and drop a hdfs file, the value won’t reduce and will reflect the actual data change rate.

Step 4)

Summing up these values for all the data nodes will give the overall data change in the entire system.

Step 5)

Finally, in order to get the real data size that will be transferred by Data Lifecycle Manager (DLM), this value will have to be divided by the replication factor of the source cluster from where the replication is initiated.

Example, if the “Byteswritten” value for each data node is 10MB and there are 10 data nodes, then (100MB/3) = 33.3 MB will be transferred by DLM. Assuming replication factor at source = 3.

Note - If you are planning to run data replication jobs in DataLifeCycle Manager on a daily basis, you can capture these jmx metrics each day and calculate the difference between consecutive days to get the single day data change value. You can write a simple script in your preferred language to accomplish the same. Also to remember is that if your datanode is restarted at some point, the jmx value for "BytesWritten" will roll back to 0.

Cloudera Community

Community Articles

Estimating the data change rates for replication when using DataLifeCycleManager(DLM)

Apache Ambari

Apache Hadoop

Data Lifecycle Manager

HDFS