Support Questions

Find answers, ask questions, and share your expertise

Are there any recommendations or best practices for using Anti-virus with Hadoop servers?

avatar
Expert Contributor
 
1 ACCEPTED SOLUTION

avatar
Contributor

The best-practice is to avoid the use of active Anti-Virus (AV) systems that monitor access to the underlying disk systems being used for metadata storage by the following processes:

  • Apache Hadoop
    • HDFS Namenode
    • HDFS Datanode
    • YARN Resource Manager
    • YARN Node Manager
  • Apache Accumulo
  • Apache Flume
  • Apache HBase
  • Apache Kafka
  • Apache ZooKeeper

These processes store data structures only, and there is nothing stored by these processes that is executable by the underlying OS. As these processes can be very active, potentially performing continuous writes against large files, the best performance requires direct, unimpeded access to the underlying filesystem, and any AV system that traps filesystem calls will have a negative impact on Hadoop system performance.

Some sites choose to implement AV "scans" that run periodically (like a weekly scan) on clients, gateway and "edge node" systems where users & developers connect and run local processes. These scans do not interfere with cluster performance, but are important to safeguard the edge-connected systems that are the main clients of the cluster.

View solution in original post

3 REPLIES 3

avatar
Contributor

The best-practice is to avoid the use of active Anti-Virus (AV) systems that monitor access to the underlying disk systems being used for metadata storage by the following processes:

  • Apache Hadoop
    • HDFS Namenode
    • HDFS Datanode
    • YARN Resource Manager
    • YARN Node Manager
  • Apache Accumulo
  • Apache Flume
  • Apache HBase
  • Apache Kafka
  • Apache ZooKeeper

These processes store data structures only, and there is nothing stored by these processes that is executable by the underlying OS. As these processes can be very active, potentially performing continuous writes against large files, the best performance requires direct, unimpeded access to the underlying filesystem, and any AV system that traps filesystem calls will have a negative impact on Hadoop system performance.

Some sites choose to implement AV "scans" that run periodically (like a weekly scan) on clients, gateway and "edge node" systems where users & developers connect and run local processes. These scans do not interfere with cluster performance, but are important to safeguard the edge-connected systems that are the main clients of the cluster.

avatar
Rising Star

Just a note that YARN may need to execute things that are placed into its local cache on the NMs, its not purly a data storage. This is why you cant have directories that are YARN related mounted as NOEXEC in /etc/fstab...

avatar

Sometimes, the requirement to have AV on the servers is unavoidable due to security policies that cannot be challenged. In that event, prepare for the need to add significantly more nodes, more memory and more cpus to get the same levels of performance.