Created on 10-10-2019 08:28 AM - edited 09-16-2022 01:45 AM
Ranger audit logs are stored in HDFS for long term storage, and can grow to a significant size after while. It might be useful to turn those logs into Hive tables using ORC as the backing store for compression and query speed. This will allow you to connect any external tool via ODBC to Hive and view your audit logs as SQL tables.
Each plugin for Ranger that you have enabled will log to a different sub directory within /ranger/audit on HDFS:
[hdfs@ip-10-0-1-75 ~]$ hdfs dfs -ls /ranger/audit/
Found 2 items
drwx------ - hdfs hdfs 0 2019-10-10 14:19 /ranger/audit/hdfs
drwxr-xr-x - hive hive 0 2019-10-10 14:12 /ranger/audit/hiveServer2
And then within each of these, you will have a sub-directory for audit logs for every day:
[hdfs@ip-10-0-1-75 ~]$ hdfs dfs -ls /ranger/audit/hdfs
Found 16 items
drwx------ - hdfs hdfs 0 2019-09-25 09:55 /ranger/audit/hdfs/20190925
drwxr-xr-x - hdfs hdfs 0 2019-09-26 00:00 /ranger/audit/hdfs/20190926
drwxr-xr-x - hdfs hdfs 0 2019-09-27 00:00 /ranger/audit/hdfs/20190927
drwxr-xr-x - hdfs hdfs 0 2019-09-28 00:00 /ranger/audit/hdfs/20190928
drwxr-xr-x - hdfs hdfs 0 2019-09-29 00:00 /ranger/audit/hdfs/20190929
drwxr-xr-x - hdfs hdfs 0 2019-09-30 00:00 /ranger/audit/hdfs/20190930
drwxr-xr-x - hdfs hdfs 0 2019-10-01 00:00 /ranger/audit/hdfs/20191001
drwxr-xr-x - hdfs hdfs 0 2019-10-02 00:00 /ranger/audit/hdfs/20191002
drwxr-xr-x - hdfs hdfs 0 2019-10-03 14:43 /ranger/audit/hdfs/20191003
drwxr-xr-x - hdfs hdfs 0 2019-10-04 00:00 /ranger/audit/hdfs/20191004
drwxr-xr-x - hdfs hdfs 0 2019-10-05 00:00 /ranger/audit/hdfs/20191005
drwxr-xr-x - hdfs hdfs 0 2019-10-06 00:00 /ranger/audit/hdfs/20191006
drwxr-xr-x - hdfs hdfs 0 2019-10-07 00:00 /ranger/audit/hdfs/20191007
drwxr-xr-x - hdfs hdfs 0 2019-10-08 00:00 /ranger/audit/hdfs/20191008
drwxr-xr-x - hdfs hdfs 0 2019-10-09 00:00 /ranger/audit/hdfs/20191009
drwxr-xr-x - hdfs hdfs 0 2019-10-10 00:00 /ranger/audit/hdfs/20191010
Looking at one of the sample logs, we can see that each log message is just a JSON string per line:
[hdfs@ip-10-0-1-75 ~]$ hdfs dfs -cat /ranger/audit/hdfs/20191002/hdfs_ranger_audit_ip-10-0-1-75.eu-west-3.compute.internal.log | tail -n 1 | python -m json.tool
{
"access": "READ_EXECUTE",
"action": "execute",
"agentHost": "ip-10-0-1-75.eu-west-3.compute.internal",
"cliIP": "10.0.1.75",
"cluster_name": "singlenode",
"enforcer": "hadoop-acl",
"event_count": 1,
"event_dur_ms": 0,
"evtTime": "2019-10-02 23:59:56.356",
"id": "bc54faf1-7afc-49d9-b32c-45e1f9ba8b47-333762",
"logType": "RangerAudit",
"policy": -1,
"reason": "/user/ams/hbase/oldWALs",
"repo": "singlenode_hadoop",
"repoType": 1,
"reqUser": "ams",
"resType": "path",
"resource": "/user/ams/hbase/oldWALs",
"result": 1,
"seq_num": 644489,
"tags": []
}
We can use a JSON serde for Hive to create external tables on top of these directories, and view these logs as if they are Hive tables in effect.
First, you need to download the serde and place it somewhere in HDFS:
wget "<a href="<a href="https://github.com/cdamak/Twitter-Hive/raw/master/json-serde-1.3.8-jar-with-dependencies.jar" target="_blank">https://github.com/cdamak/Twitter-Hive/raw/master/json-serde-1.3.8-jar-with-dependencies.jar</a>" target="_blank"><a href="https://github.com/cdamak/Twitter-Hive/raw/master/json-serde-1.3.8-jar-with-dependencies.jar</a" target="_blank">https://github.com/cdamak/Twitter-Hive/raw/master/json-serde-1.3.8-jar-with-dependencies.jar</a</a>>"
And then upload to HDFS:
hdfs dfs -put json-serde-1.3.8-jar-with-dependencies.jar /tmp/
Next step, open up Beeline, and import the JAR file:
add jar hdfs:///tmp/json-serde-1.3.8-jar-with-dependencies.jar;
You are now ready to create an external Hive table for each of the Ranger audit logs.
Here is for HDFS audit logs:
CREATE EXTERNAL TABLE ext_ranger_audit_hdfs (
access string,
action string,
agentHost string,
cliIP string,
cluster_name string,
enforcer string,
event_count string,
event_dur_ms string,
evtTime string,
id string,
logType string,
policy string,
reason string,
repo string,
repoType string,
reqUser string,
resType string,
resource string,
result string,
seq_num string,
tags string
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION '/ranger/audit/hdfs/';
And one for Hive audit logs:
CREATE EXTERNAL TABLE ext_ranger_audit_hive (
access string,
action string,
additional_info string,
agentHost string,
cliIP string,
cliType string,
cluster_name string,
enforcer string,
event_count string,
event_dur_ms string,
evtTime string,
id string,
logType string,
policy string,
repo string,
repoType string,
reqData string,
reqUser string,
resType string,
resource string,
result string,
seq_num string,
sess string,
tags string
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION '/ranger/audit/hiveServer2';
You can now run your favourite SQL queries on top of your tables, for example:
0: jdbc:hive2://ip-10-0-1-75.eu-west-3.comput> select count(*) from ext_ranger_audit_hdfs;
0: jdbc:hive2://ip-10-0-1-75.eu-west-3.comput> select count(*) from ext_ranger_audit_hive;
The next step is to create an internal Hive table, which is ORC based and will give you fast query speeds through compression and partitioning.
First, the internal HDFS audit table:
CREATE TABLE ranger_audit_hdfs (
access string,
action string,
agentHost string,
cliIP string,
cluster_name string,
enforcer string,
event_count string,
event_dur_ms string,
evtTime string,
id string,
logType string,
policy string,
reason string,
repo string,
repoType string,
reqUser string,
resType string,
resource string,
result string,
seq_num string,
tags string
) partitioned by (date_year int, date_month int, date_day int)
stored as orc
TBLPROPERTIES ( 'transactional'='true' );
And the internal Hive audit log table:
CREATE TABLE ranger_audit_hive (
access string,
action string,
additional_info string,
agentHost string,
cliIP string,
cliType string,
cluster_name string,
enforcer string,
event_count string,
event_dur_ms string,
evtTime string,
id string,
logType string,
policy string,
repo string,
repoType string,
reqData string,
reqUser string,
resType string,
resource string,
result string,
seq_num string,
sess string,
tags string
) partitioned by (date_year int, date_month int, date_day int)
stored as orc
TBLPROPERTIES ( 'transactional'='true' );
At this point, both the internal tables will be empty, so we can populate them from our externally mapped JSON tables.
Load the internal HDFS audit table:
set hive.exec.dynamic.partition=true;
INSERT INTO ranger_audit_hdfs PARTITION (date_year, date_month, date_day)
select *, cast(substr(a.evttime, 0, 4) as int),
cast(substr(a.evttime, 6, 2) as int),
cast(substr(a.evttime, 9, 2) as int)
from ext_ranger_audit_hdfs a where a.seq_num not in (select b.seq_num from ranger_audit_hdfs b);
Load the internal Hive audit table:
set hive.exec.dynamic.partition=true;
INSERT INTO ranger_audit_hive PARTITION (date_year, date_month, date_day)
select *, cast(substr(a.evttime, 0, 4) as int),
cast(substr(a.evttime, 6, 2) as int),
cast(substr(a.evttime, 9, 2) as int)
from ext_ranger_audit_hive a where a.seq_num not in (select b.seq_num from ranger_audit_hive b);
At the end of each day, you can re-run the two statements above to load the new day's data from the external table into the internal table for fast querying. Once you have loaded the data, you can go back to hdfs://ranger/audit/<service>, and drop the old data.
Your tables are now stored in ORC, with compression and partitioned by YYYY/MM/DD:
[hdfs@ip-10-0-1-75 ~]$ hdfs dfs -ls /warehouse/tablespace/managed/hive/ranger_audit_hdfs/date_year=2019/date_month=10
Found 10 items
drwxrwx---+ - hive hadoop 0 2019-10-08 10:59 /warehouse/tablespace/managed/hive/ranger_audit_hdfs/date_year=2019/date_month=10/date_day=1
drwxrwx---+ - hive hadoop 0 2019-10-10 14:16 /warehouse/tablespace/managed/hive/ranger_audit_hdfs/date_year=2019/date_month=10/date_day=10
drwxrwx---+ - hive hadoop 0 2019-10-08 10:59 /warehouse/tablespace/managed/hive/ranger_audit_hdfs/date_year=2019/date_month=10/date_day=2
drwxrwx---+ - hive hadoop 0 2019-10-08 10:59 /warehouse/tablespace/managed/hive/ranger_audit_hdfs/date_year=2019/date_month=10/date_day=3
drwxrwx---+ - hive hadoop 0 2019-10-08 10:59 /warehouse/tablespace/managed/hive/ranger_audit_hdfs/date_year=2019/date_month=10/date_day=4
drwxrwx---+ - hive hadoop 0 2019-10-08 10:59 /warehouse/tablespace/managed/hive/ranger_audit_hdfs/date_year=2019/date_month=10/date_day=5
drwxrwx---+ - hive hadoop 0 2019-10-08 10:59 /warehouse/tablespace/managed/hive/ranger_audit_hdfs/date_year=2019/date_month=10/date_day=6
drwxrwx---+ - hive hadoop 0 2019-10-08 10:59 /warehouse/tablespace/managed/hive/ranger_audit_hdfs/date_year=2019/date_month=10/date_day=7
drwxrwx---+ - hive hadoop 0 2019-10-10 14:16 /warehouse/tablespace/managed/hive/ranger_audit_hdfs/date_year=2019/date_month=10/date_day=8
drwxrwx---+ - hive hadoop 0 2019-10-10 14:16 /warehouse/tablespace/managed/hive/ranger_audit_hdfs/date_year=2019/date_month=10/date_day=9
We brought the space down from 493MB to 7.2MB (1.4% of the original size):
[hdfs@ip-10-0-1-75 ~]$ hdfs dfs -du -h -s /ranger/audit/hdfs/;
365.5 M 493.5 M /ranger/audit/hdfs
[hdfs@ip-10-0-1-75 ~]$ hdfs dfs -du -h -s /warehouse/tablespace/managed/hive/ranger_audit_hdfs
7.2 M 7.2 M /warehouse/tablespace/managed/hive/ranger_audit_hdfs