Reply
Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Audit trail for HDFS data use

Hi,

 

I'm using CM to manage my cluster and have retention for 13 months to the HDFS data, I have several Rnd and Data analyst that run hive and impala on this data, i'm intersting to know if i can know which of data are requested or queried, my assumption that if no one is querying HDFS data from spsecific time, i can reduce my retention

Expert Contributor
Posts: 176
Registered: ‎05-16-2016

Re: Audit trail for HDFS data use

[ Edited ]

You can set the audit trail using apache log4j properties. 

Each impala demons have their own audit logs. 

You can set the audit event tracker using CM or manually for monitoring the hive /impala user activities. 

Please refer the below knowedlge base on Cloudera 

https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cn_iu_service_audit.html#cn_topic_6__...

Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Re: Audit trail for HDFS data use

Is it for enterprise versions only? i'm using the express version

Expert Contributor
Posts: 176
Registered: ‎05-16-2016

Re: Audit trail for HDFS data use

You can Configuration audit trails in both version  .

Refer the link 

 

https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_ig_feature_differences.html

Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Re: Audit trail for HDFS data use

How i can do this in log4j? Should i identify rules? can i have an example how to do this?

 

I need only to know which HDFS data partitions query run on

Highlighted
Expert Contributor
Posts: 176
Registered: ‎05-16-2016

Re: Audit trail for HDFS data use

[ Edited ]

For Hive Audit Trail 

/conf/hive-log4j.properties 

 

log4j.appender.HAUDIT=org.apache.log4j.DailyRollingFileAppender
log4j.appender.HAUDIT.File=${hive.log.dir}/hive_audit.log
log4j.appender.HAUDIT.DatePattern=.yyyy-MM-dd
log4j.appender.HAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.HAUDIT.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} (%F:%M(%L)) - %m%n
log4j.logger.org.apache.hadoop.hive.metastore.HiveMetaStore.audit=INFO,HAUDIT

For hdfs 

 

hdfs.audit.logger=INFO,NullAppender
hdfs.audit.log.maxfilesize= as per you wish in MB
hdfs.audit.log.maxbackupindex=20
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=${hdfs.audit.logger}
log4j.additivity.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=false 
#log4j.logger.org.apache.hadoop.security=DEBUG,RFAAUDIT 
log4j.appender.RFAAUDIT=org.apache.log4j.RollingFileAppender
log4j.appender.RFAAUDIT.File=${hadoop.log.dir}/hdfs-audit.log
log4j.appender.RFAAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
log4j.appender.RFAAUDIT.MaxFileSize=${hdfs.audit.log.maxfilesize}
log4j.appender.RFAAUDIT.MaxBackupIndex=${hdfs.audit.log.maxbackupindex}

Hope this helps you .

Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Re: Audit trail for HDFS data use

[ Edited ]

Thanks this was helpful

Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Re: Audit trail for HDFS data use

Can this be defined at cloudera manager level?

 

Also i see the max size and frequency editing are limited for enterprise license.

Expert Contributor
Posts: 176
Registered: ‎05-16-2016

Re: Audit trail for HDFS data use

Yes you can certainly define log4j properties using Cloudera manager. 

 Th above link that I provided reveals that we can enable audit trail on both version , I am not sure about the property specific   pertain to audit trails against the  version. sorry 

Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Re: Audit trail for HDFS data use

Thanks,

 

I ready the above link, but seems the edit/change options are only available at enterprise versions, this what i see in cloudera manager:

Maximum Audit Log File Size and Number of Audit Logs to Retain  are requiring appropriate Cloudera Enterprise licens.

 

So you still limited with its use, maybe there any other way for my issue?

 

I have hdfs that partitions by event type and each event type is paritions by year, month, and day, i have analytics and other team that running hive an impala on specific event type and parititions, my goal to know which paritions and event type and queried, so i can define the optimal retention for each event type.

Announcements