Community Articles

dvillarreal · ‎05-17-2019

Customers have asked me about wanting to review ranger audit archive logs stored on HDFS as the UI only shows the Last 90 days of data using Solr infra. I decided to approach the problem using Zeppelin/Spark for a fun example.

1. Prerequisites - Zeppelin and Spark2 installed on your system. As well as ranger with ranger audit logs being stored in HDFS. Create a policy in ranger for HDFS to allow your zeppelin user to read and execute recursively for /ranger/audit directory.

2. Create your notebook in Zeppelin and create some code like the following example:

%spark2.spark

// --Specify service and date if you wish
//val path = "/ranger/audit/hdfs/20190513/*.log"

// --Be brave and map the whole enchilada
val path = "/ranger/audit/*/*/*.log"

// --read in the json and drop any malformed json
val rauditDF = spark.read.option("mode", "DROPMALFORMED").json(path)

// --print the schema to review and show me top 20 lines.
rauditDF.printSchema()
rauditDF.show(20,false)

// --Do some spark sql on the data and look for denials
println("sparksql--------------------")
rauditDF.createOrReplaceTempView(viewName="audit")
var readAccessDF = spark.sql("SELECT reqUser, repo, access, action, evtTime, policy, resource, reason, enforcer, result FROM audit where result='0'").withColumn("new_result", when(col("result") === "1","Allowed").otherwise("Denied"))
readAccessDF.show(20,false)

3. Output should look something like

path: String = /ranger/audit/*/*/*.log
rauditDF: org.apache.spark.sql.DataFrame = [access: string, action: string ... 23 more fields]
root
 |-- access: string (nullable = true)
 |-- action: string (nullable = true)
 |-- additional_info: string (nullable = true)
 |-- agentHost: string (nullable = true)
 |-- cliIP: string (nullable = true)
 |-- cliType: string (nullable = true)
 |-- cluster_name: string (nullable = true)
 |-- enforcer: string (nullable = true)
 |-- event_count: long (nullable = true)
 |-- event_dur_ms: long (nullable = true)
 |-- evtTime: string (nullable = true)
 |-- id: string (nullable = true)
 |-- logType: string (nullable = true)
 |-- policy: long (nullable = true)
 |-- reason: string (nullable = true)
 |-- repo: string (nullable = true)
 |-- repoType: long (nullable = true)
 |-- reqData: string (nullable = true)
 |-- reqUser: string (nullable = true)
 |-- resType: string (nullable = true)
 |-- resource: string (nullable = true)
 |-- result: long (nullable = true)
 |-- seq_num: long (nullable = true)
 |-- sess: string (nullable = true)
 |-- tags: array (nullable = true)
 |    |-- element: string (containsNull = true)

sql
readAccessDF: org.apache.spark.sql.DataFrame = [reqUser: string, repo: string ... 9 more fields]
+--------+------------+------------+-------+-----------------------+------+-------------------------------------------------------------------------------------+----------------------------------+----------+------+----------+
|reqUser |repo        |access      |action |evtTime                |policy|resource                                                                             |reason                            |enforcer  |result|new_result|
+--------+------------+------------+-------+-----------------------+------+-------------------------------------------------------------------------------------+----------------------------------+----------+------+----------+
|dav     |c3205_hadoop|READ_EXECUTE|execute|2019-05-13 22:07:23.971|-1    |/ranger/audit/hdfs                                                                   |/ranger/audit/hdfs                |hadoop-acl|0     |Denied    |
|zeppelin|c3205_hadoop|READ_EXECUTE|execute|2019-05-13 22:10:47.288|-1    |/ranger/audit/hdfs                                                                   |/ranger/audit/hdfs                |hadoop-acl|0     |Denied    |
|dav     |c3205_hadoop|EXECUTE     |execute|2019-05-13 23:57:49.410|-1    |/ranger/audit/hiveServer2/20190513/hiveServer2_ranger_audit_c3205-node3.hwx.local.log|/ranger/audit/hiveServer2/20190513|hadoop-acl|0     |Denied    |
|zeppelin|c3205_hive  |USE         |_any   |2019-05-13 23:42:50.643|-1    |null                                                                                 |null                              |ranger-acl|0     |Denied    |
|zeppelin|c3205_hive  |USE         |_any   |2019-05-13 23:43:08.732|-1    |default                                                                              |null                              |ranger-acl|0     |Denied    |
|dav     |c3205_hive  |USE         |_any   |2019-05-13 23:48:37.603|-1    |null                                                                                 |null                              |ranger-acl|0     |Denied    |
+--------+------------+------------+-------+-----------------------+------+-------------------------------------------------------------------------------------+----------------------------------+----------+------+----------+

4. You can proceed to run sql as well on the audit view information using sql if you so desire.

5. You may need to fine tune your spark interpreter in zeppelin to meet your needs like SPARK_DRIVER_MEMORY, spark.executor.cores, spark.executor.instances, & spark.executor.memory. It helped to see what was happening by tailing the zeppelin log for spark.

 tailf zeppelin-interpreter-spark2-spark-zeppelin-cluster1.hwx.log

Cloudera Community

Community Articles

Using Zeppelin/Spark to query HDFS Ranger Audit logs

Apache Ranger

Apache Spark

Apache Zeppelin