Support Questions

Find answers, ask questions, and share your expertise

Is there ever a possibility where data in hdfs gets written to a master node log or disk?

avatar
Guru

Is there ever a possibility where data in hdfs gets written to a master node log (e.g. via YARN, Oozie, Zookeeper) or other area of disk? Reason I am asking is because of strict security concerns of knowing everywhere that sensitive hdfs data may end up.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Greg,

Yes, there is always a possibility depending on services deployed in the environment. For instance, if you have an audit server deployed on a so-called masternode, and a predefined audit policy, such as a Hive query, may trigger sensitive data being written to a local folder that is not in HDFS. For those type of cases, you need to set up redaction, encryption, and/or service level authentication/authorization strategies to protect the sensitive data, such as PII, PCI, SSN & etc.

Speaking of Yarn, Oozie and Zookeeper in particular, you should be fine. Yarn has all the AM and containers logs stored on the datanodes, only high level RM logs are stored on the RM nodes, furthermore, you should configureJob history logs to be written to directory in HDFS, and apply HDFS native encryption on that folder if needed, /user/history/ for instance. Oozie service logs shouldn’t contain any sensitive information either, as they should only contain high level information, such as which part of a workflow fails, and you will need to drill down to the individual service logs to get more insights. Zookeeper is the same thing, only high level information stored on znodes depending on the services deployed in your environment, such as Solr schemas, Kafka topics offsets & etc.

Hope that helps.

Derek

View solution in original post

1 REPLY 1

avatar
Expert Contributor

Greg,

Yes, there is always a possibility depending on services deployed in the environment. For instance, if you have an audit server deployed on a so-called masternode, and a predefined audit policy, such as a Hive query, may trigger sensitive data being written to a local folder that is not in HDFS. For those type of cases, you need to set up redaction, encryption, and/or service level authentication/authorization strategies to protect the sensitive data, such as PII, PCI, SSN & etc.

Speaking of Yarn, Oozie and Zookeeper in particular, you should be fine. Yarn has all the AM and containers logs stored on the datanodes, only high level RM logs are stored on the RM nodes, furthermore, you should configureJob history logs to be written to directory in HDFS, and apply HDFS native encryption on that folder if needed, /user/history/ for instance. Oozie service logs shouldn’t contain any sensitive information either, as they should only contain high level information, such as which part of a workflow fails, and you will need to drill down to the individual service logs to get more insights. Zookeeper is the same thing, only high level information stored on znodes depending on the services deployed in your environment, such as Solr schemas, Kafka topics offsets & etc.

Hope that helps.

Derek