Created on 05-17-201906:50 PM - edited 08-17-201902:24 PM
Customers have asked me about wanting to review ranger audit archive logs stored on HDFS as the UI only shows the Last 90 days of data using Solr infra. I decided to approach the problem using Zeppelin/Spark for a fun example.
1. Prerequisites - Zeppelin and Spark2 installed on your system. As well as ranger with ranger audit logs being stored in HDFS. Create a policy in ranger for HDFS to allow your zeppelin user to read and execute recursively for /ranger/audit directory.
2. Create your notebook in Zeppelin and create some code like the following example:
// --Specify service and date if you wish
//val path = "/ranger/audit/hdfs/20190513/*.log"
// --Be brave and map the whole enchilada
val path = "/ranger/audit/*/*/*.log"
// --read in the json and drop any malformed json
val rauditDF = spark.read.option("mode", "DROPMALFORMED").json(path)
// --print the schema to review and show me top 20 lines.
// --Do some spark sql on the data and look for denials
var readAccessDF = spark.sql("SELECT reqUser, repo, access, action, evtTime, policy, resource, reason, enforcer, result FROM audit where result='0'").withColumn("new_result", when(col("result") === "1","Allowed").otherwise("Denied"))
4. You can proceed to run sql as well on the audit view information using sql if you so desire.
5. You may need to fine tune your spark interpreter in zeppelin to meet your needs like SPARK_DRIVER_MEMORY, spark.executor.cores, spark.executor.instances, & spark.executor.memory. It helped to see what was happening by tailing the zeppelin log for spark.