Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar

SmartSense 1.3 includes Activity Explorer, which hosts prebuilt notebooks that visualize cluster utilization data related to user, queue, job duration, and job resource consumption, including an HDFS Dashboard notebook. This dashboard helps operators better understand how HDFS is being used and which users and jobs are consuming the most resources within the file system.

It's important to note that the source data for ACTIVITY.HDFS_USER_FILE_SUMMARY comes from fsimage, which does not contain file- and directory-level information. Many operators are also interested in more fine-grained analytics regarding cluster data use, which can drive decisions such as storage tiering using HDFS heterogeneous storage.

Since these data are not available in fsimage, we will use the Ranger audit data for HDFS which the Ranger plugin writes during authorization events. The best practice is for the plugin to write these data to both Solr (for short-term use, driving performance in the UI) as well as HDFS for long-term storage.

Please note the principal used in the GetHdfs processor will need read access for the HDFS directory storing the Ranger audit data.

The audit data, after some formatting for readability, looks like:

9551-screen-shot-2016-11-17-at-105318-am.png

We will create a NiFi dataflow (ranger-audit-analytics.xml) to shred this JSON data into a Hive table, please see the below and the attached template.

9552-screen-shot-2016-11-17-at-122651-am.png

We first use GetHDFS to pull the audit data file, and then split the flowfile by line as each line contains a JSON fragment. EvaluateJsonPath is used to pull particular attributes that are valuable for analytics:

9553-screen-shot-2016-11-17-at-110247-am.png

We use ReplaceText to create the DDL statements to populate our Hive table:

9554-screen-shot-2016-11-17-at-110355-am.png

And finally, we use PutHiveQL to execute these INSERT statements. Once we've loaded these data into Hive, we're ready to use Zeppelin to explore and visualize the data.

For instance, let's take a look at most frequently accessed directories:

9556-screen-shot-2016-11-17-at-110604-am.png

As another example, we can see the last time a particular resource was accessed:

9557-screen-shot-2016-11-17-at-111239-am.png

These visualizations can be combined with the HDFS Dashboard ones for a more robust picture of HDFS-related activity on a multi-tenant cluster.

Hive Table Schema:

create external table audit
(reqUser string,
evtTime timestamp,
access string,
resource string,
action string,
cliIP string
)
ROW FORMAT DELIMITED
STORED AS ORC
LOCATION '/user/nifi/audit';
2,526 Views
Comments
avatar
Rising Star

Very nice article. If you got step by step procedure with pre-requisites,could you pls fwd to me (muthukumar.siva@gmail.com) i would like to implement in my environment. Thank you in advance.