Created on 08-14-201804:42 PM - edited 08-17-201906:46 AM
The purpose of this document is how to leverage “R” to
predict HDFS growth assuming we have access to the latest fsimage of a given
cluster. This way we can forecast how much capacity would need to be added to
the cluster ahead of time. In case of on-prem clusters, ordering H/W can be a
lengthy process and for some organizations, it can take months till they can
actually add capacity to their clusters. Giving teams managing Hadoop clusters
an insight on HDFS growth rate can help with placing H/W orders ahead of time
A prereq. for this script, you would need access to a
recent fsimage file, R and a machine with hadoop binaries installed on it to
run OIV.
Example: hdfs oiv -i fsimage_xxxxx -p Delimited
> fsimage.out
Loading required libraries
library(dplyr)
library(anytime)
library(lubridate)
library(prophet)
library(rmarkdown) --> only needed if you are using RStudio and planning on exporting the script to a PDF or Word format
Data Filtering & manipulation
The only three columns we need are Filesize,
Replication and ModificationTime. Also by choozing FileSize > 0, removes all
directories in the file from our calculation as well as zero size files.
Here I am calculating actual file sizes on disk (file
size * replication factor) and converting size to GB (not really required).
Then filtering on which day I want to start my calculation from.