Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Audit trail for HDFS data use

Highlighted

Audit trail for HDFS data use

Super Collaborator

Hi,

 

I'm using CM to manage my cluster and have retention for 13 months to the HDFS data, I have several Rnd and Data analyst that run hive and impala on this data, i'm intersting to know if i can know which of data are requested or queried, my assumption that if no one is querying HDFS data from spsecific time, i can reduce my retention

14 REPLIES 14

Re: Audit trail for HDFS data use

Champion

You can set the audit trail using apache log4j properties. 

Each impala demons have their own audit logs. 

You can set the audit event tracker using CM or manually for monitoring the hive /impala user activities. 

Please refer the below knowedlge base on Cloudera 

https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cn_iu_service_audit.html#cn_topic_6__...

Re: Audit trail for HDFS data use

Super Collaborator

Is it for enterprise versions only? i'm using the express version

Re: Audit trail for HDFS data use

Champion

You can Configuration audit trails in both version  .

Refer the link 

 

https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_ig_feature_differences.html

Re: Audit trail for HDFS data use

Super Collaborator

How i can do this in log4j? Should i identify rules? can i have an example how to do this?

 

I need only to know which HDFS data partitions query run on

Re: Audit trail for HDFS data use

Champion

For Hive Audit Trail 

/conf/hive-log4j.properties 

 

log4j.appender.HAUDIT=org.apache.log4j.DailyRollingFileAppender
log4j.appender.HAUDIT.File=${hive.log.dir}/hive_audit.log
log4j.appender.HAUDIT.DatePattern=.yyyy-MM-dd
log4j.appender.HAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.HAUDIT.layout.ConversionPattern=%d{ISO8601} %-5p %c{2} (%F:%M(%L)) - %m%n
log4j.logger.org.apache.hadoop.hive.metastore.HiveMetaStore.audit=INFO,HAUDIT

For hdfs 

 

hdfs.audit.logger=INFO,NullAppender
hdfs.audit.log.maxfilesize= as per you wish in MB
hdfs.audit.log.maxbackupindex=20
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=${hdfs.audit.logger}
log4j.additivity.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=false 
#log4j.logger.org.apache.hadoop.security=DEBUG,RFAAUDIT 
log4j.appender.RFAAUDIT=org.apache.log4j.RollingFileAppender
log4j.appender.RFAAUDIT.File=${hadoop.log.dir}/hdfs-audit.log
log4j.appender.RFAAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
log4j.appender.RFAAUDIT.MaxFileSize=${hdfs.audit.log.maxfilesize}
log4j.appender.RFAAUDIT.MaxBackupIndex=${hdfs.audit.log.maxbackupindex}

Hope this helps you .

Re: Audit trail for HDFS data use

Super Collaborator

Thanks this was helpful

Re: Audit trail for HDFS data use

Super Collaborator

Can this be defined at cloudera manager level?

 

Also i see the max size and frequency editing are limited for enterprise license.

Re: Audit trail for HDFS data use

Champion

Yes you can certainly define log4j properties using Cloudera manager. 

 Th above link that I provided reveals that we can enable audit trail on both version , I am not sure about the property specific   pertain to audit trails against the  version. sorry 

Re: Audit trail for HDFS data use

Super Collaborator

Thanks,

 

I ready the above link, but seems the edit/change options are only available at enterprise versions, this what i see in cloudera manager:

Maximum Audit Log File Size and Number of Audit Logs to Retain  are requiring appropriate Cloudera Enterprise licens.

 

So you still limited with its use, maybe there any other way for my issue?

 

I have hdfs that partitions by event type and each event type is paritions by year, month, and day, i have analytics and other team that running hive an impala on specific event type and parititions, my goal to know which paritions and event type and queried, so i can define the optimal retention for each event type.