Created 12-05-2016 08:24 PM
Is there ever a possibility where data in hdfs gets written to a master node log (e.g. via YARN, Oozie, Zookeeper) or other area of disk? Reason I am asking is because of strict security concerns of knowing everywhere that sensitive hdfs data may end up.
Created 12-05-2016 11:24 PM
Greg,
Yes, there is always a possibility depending on services deployed in the environment. For instance, if you have an audit server deployed on a so-called masternode, and a predefined audit policy, such as a Hive query, may trigger sensitive data being written to a local folder that is not in HDFS. For those type of cases, you need to set up redaction, encryption, and/or service level authentication/authorization strategies to protect the sensitive data, such as PII, PCI, SSN & etc.
Speaking of Yarn, Oozie and Zookeeper in particular, you should be fine. Yarn has all the AM and containers logs stored on the datanodes, only high level RM logs are stored on the RM nodes, furthermore, you should configureJob history logs to be written to directory in HDFS, and apply HDFS native encryption on that folder if needed, /user/history/ for instance. Oozie service logs shouldn’t contain any sensitive information either, as they should only contain high level information, such as which part of a workflow fails, and you will need to drill down to the individual service logs to get more insights. Zookeeper is the same thing, only high level information stored on znodes depending on the services deployed in your environment, such as Solr schemas, Kafka topics offsets & etc.
Hope that helps.
Derek
Created 12-05-2016 11:24 PM
Greg,
Yes, there is always a possibility depending on services deployed in the environment. For instance, if you have an audit server deployed on a so-called masternode, and a predefined audit policy, such as a Hive query, may trigger sensitive data being written to a local folder that is not in HDFS. For those type of cases, you need to set up redaction, encryption, and/or service level authentication/authorization strategies to protect the sensitive data, such as PII, PCI, SSN & etc.
Speaking of Yarn, Oozie and Zookeeper in particular, you should be fine. Yarn has all the AM and containers logs stored on the datanodes, only high level RM logs are stored on the RM nodes, furthermore, you should configureJob history logs to be written to directory in HDFS, and apply HDFS native encryption on that folder if needed, /user/history/ for instance. Oozie service logs shouldn’t contain any sensitive information either, as they should only contain high level information, such as which part of a workflow fails, and you will need to drill down to the individual service logs to get more insights. Zookeeper is the same thing, only high level information stored on znodes depending on the services deployed in your environment, such as Solr schemas, Kafka topics offsets & etc.
Hope that helps.
Derek