Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Ranger audit log behavior (HDFS, database and log4j)

Ranger audit log behavior (HDFS, database and log4j)

Contributor

I have both HDFS audit and DB audit enabled from HDFS Name Node (HA) Ranger plugin.

1. HDFS NN process queues audit events in memory for HDFS. It flushes out to HDFS only once a day.

  • Is it because I don't have enough events to trigger more frequent flush to HDFS?
  • HDFD log files are named after active NN host. Once NN HA fails over, another log file will be created in HDFS. What if the active NN is killed abnormally, will we lose the log for that NN? What if both NN instances are killed abnormally?

2. HDFS NN process queues audit events in memory for DB if DB connection is down. Where is the source from which it recovers when DB connection is backup? Is it from HDFS audit logs? If yes, then audit into DB depends on audit into HDFS.

3 REPLIES 3

Re: Ranger audit log behavior (HDFS, database and log4j)

Contributor
There is a slight clarification to your assumption. In the case of auditing to HDFS, we use the streaming API. So the audit events are written to HDFS in almost near-realtime. However, we close the file every 24 hours (default, but configurable). So on the HDFS side, you won't be able to read the file till it is closed. If the process (NN) dies, the files are automatically closed. And we create a new file when the NN restarts.
For your second question, the following should answer it.
  1. If the destination is down, it will write to local file and resume when it is available.
  2. If the destination is slower than the rate the audits are generated, then it will spool to local file and throttle the writing. But it will eventually send the audits (local spool size is configurable and dependent on availability of disk space)
  3. If you are using components like Hbase, Kafka or Solr which generate way too many audit records, then it will summarize the audits at the source based on unique user+request and send the summarized audits.
  4. It uses different queues and spool file for each destination. So If you have destinations which support different speed (e.g. Solr v/s HDFS), you will not lose audits and also the faster destinations will get audit records sooner.

Re: Ranger audit log behavior (HDFS, database and log4j)

Contributor

Regarding to your answer 1: where to locate the local file? In spool directory? I can't find any in DB spool directory.

Re: Ranger audit log behavior (HDFS, database and log4j)

Super Guru

@wayne2chicago Ranger spooling information can be found here under configuration related to file spooling.

Don't have an account?
Coming from Hortonworks? Activate your account here