Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Is there ever a possibility where data in hdfs gets written to a master node log or disk?

Solved Go to solution

Is there ever a possibility where data in hdfs gets written to a master node log or disk?

Guru

Is there ever a possibility where data in hdfs gets written to a master node log (e.g. via YARN, Oozie, Zookeeper) or other area of disk? Reason I am asking is because of strict security concerns of knowing everywhere that sensitive hdfs data may end up.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Is there ever a possibility where data in hdfs gets written to a master node log or disk?

Rising Star

Greg,

Yes, there is always a possibility depending on services deployed in the environment. For instance, if you have an audit server deployed on a so-called masternode, and a predefined audit policy, such as a Hive query, may trigger sensitive data being written to a local folder that is not in HDFS. For those type of cases, you need to set up redaction, encryption, and/or service level authentication/authorization strategies to protect the sensitive data, such as PII, PCI, SSN & etc.

Speaking of Yarn, Oozie and Zookeeper in particular, you should be fine. Yarn has all the AM and containers logs stored on the datanodes, only high level RM logs are stored on the RM nodes, furthermore, you should configureJob history logs to be written to directory in HDFS, and apply HDFS native encryption on that folder if needed, /user/history/ for instance. Oozie service logs shouldn’t contain any sensitive information either, as they should only contain high level information, such as which part of a workflow fails, and you will need to drill down to the individual service logs to get more insights. Zookeeper is the same thing, only high level information stored on znodes depending on the services deployed in your environment, such as Solr schemas, Kafka topics offsets & etc.

Hope that helps.

Derek

View solution in original post

1 REPLY 1
Highlighted

Re: Is there ever a possibility where data in hdfs gets written to a master node log or disk?

Rising Star

Greg,

Yes, there is always a possibility depending on services deployed in the environment. For instance, if you have an audit server deployed on a so-called masternode, and a predefined audit policy, such as a Hive query, may trigger sensitive data being written to a local folder that is not in HDFS. For those type of cases, you need to set up redaction, encryption, and/or service level authentication/authorization strategies to protect the sensitive data, such as PII, PCI, SSN & etc.

Speaking of Yarn, Oozie and Zookeeper in particular, you should be fine. Yarn has all the AM and containers logs stored on the datanodes, only high level RM logs are stored on the RM nodes, furthermore, you should configureJob history logs to be written to directory in HDFS, and apply HDFS native encryption on that folder if needed, /user/history/ for instance. Oozie service logs shouldn’t contain any sensitive information either, as they should only contain high level information, such as which part of a workflow fails, and you will need to drill down to the individual service logs to get more insights. Zookeeper is the same thing, only high level information stored on znodes depending on the services deployed in your environment, such as Solr schemas, Kafka topics offsets & etc.

Hope that helps.

Derek

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here