Created on 11-17-201609:27 PM - edited 08-17-201908:02 AM
SmartSense 1.3 includes Activity Explorer, which hosts prebuilt notebooks that visualize cluster utilization data related to user, queue, job duration, and job resource consumption, including an HDFS Dashboard notebook. This dashboard helps operators better understand how HDFS is being used and which users and jobs are consuming the most resources within the file system.
It's important to note that the source data for ACTIVITY.HDFS_USER_FILE_SUMMARY comes from fsimage, which does not contain file- and directory-level information. Many operators are also interested in more fine-grained analytics regarding cluster data use, which can drive decisions such as storage tiering using HDFS heterogeneous storage.
Since these data are not available in fsimage, we will use the Ranger audit data for HDFS which the Ranger plugin writes during authorization events. The best practice is for the plugin to write these data to both Solr (for short-term use, driving performance in the UI) as well as HDFS for long-term storage.
Please note the principal used in the GetHdfs processor will need read access for the HDFS directory storing the Ranger audit data.
The audit data, after some formatting for readability, looks like:
We will create a NiFi dataflow (ranger-audit-analytics.xml) to shred this JSON data into a Hive table, please see the below and the attached template.
We first use GetHDFS to pull the audit data file, and then split the flowfile by line as each line contains a JSON fragment. EvaluateJsonPath is used to pull particular attributes that are valuable for analytics:
We use ReplaceText to create the DDL statements to populate our Hive table:
And finally, we use PutHiveQL to execute these INSERT statements. Once we've loaded these data into Hive, we're ready to use Zeppelin to explore and visualize the data.
For instance, let's take a look at most frequently accessed directories:
As another example, we can see the last time a particular resource was accessed:
These visualizations can be combined with the HDFS Dashboard ones for a more robust picture of HDFS-related activity on a multi-tenant cluster.