Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best-practice for using VLANs or network access control

Best-practice for using VLANs or network access control

Cloudera Employee

A SaaS company would like to establish and document explicit access methods and controls for network access from clients to their primary HDP cluster and between their primary and backup clusters.

This approach requires all nodes are on two VLANs, to isolate users from internal connections. Are there best-practices for VLAN implementations for HDP?

Proposed Network/VLAN Solution

Workflow From To VLAN Access Control
Users to UIs (Ambari, Ranger, etc.) HQ Hadoop master nodes VLAN1 VPN, Login
User jobs (Hive, PowerBI, etc.) HQ Hive server nodes VLAN1 fw (users)
Inter-Hadoop comms Master Nodes Worker Nodes VLAN2 none
Prod dB insert (Sqoop) SQL Server Hadoop primary VLAN2 no fw
Cluster replication (Falcon) Hadoop primary Hadoop backup VLAN2 static IP route, no fw
Monitoring (external) All Hadoop nodes Monitoring Systems VLAN1 fw (outbound only)
External application jobs Our app servers Hadoop master nodes VLAN1 fw (IP range, ports)

The second question is: Based on the above table, if an application wants to run a Hadoop job to get a report, does that application need to talk to both master nodes and all data nodes, or does it just talk to a single master node? Since the customer is a SaaS company, the customers internal applications would require that many servers be able to reach all Hadoop nodes, which is problematic from a security standpoint.

4 REPLIES 4
Highlighted

Re: Best-practice for using VLANs or network access control

Cloudera Employee

For the second question...

1. Applications that connect to services will need network access to any server running that service. (Hive clients connect to the Hive Server node, etc.)

2. Applications that access the HDFS file system using native HDFS RPC protocol will require network access to any Namenode or Datanode in the cluster. Most ETL or data migration integrations run into this. (External application talks to Namenode, and is returned a list of Blocks to be retrieved from potentially any datanode.)

3. To avoid the need to allow access to any node in the cluster, the use of Apache Knox should be considered, based on the following:

  • Knox is implemented in 1 or more edge nodes
  • Clients only submit REST API calls to Knox endpoints
  • Knox authenticates the user and provides the reverse-proxy for the protocol to the correct service
  • Requires clients to implement WebHDFS instead of native RPC HDFS, etc.
Highlighted

Re: Best-practice for using VLANs or network access control

New Contributor

If the external apps have latency requirements, e.g., real-time analytics, and high user concurrency, I don't see how this would work well.

Ideally, this should be configured like a database, where any app server behind a firewall that needs access can get to it, so access is controlled by the database vs. inserting a firewall (bottleneck) between the app server and database, which is a bad idea because this data path is often heavily-trafficked and highly latency-sensitive.

Couldn't Ranger be used to allow app service account access to zones vs. using Knox to verify every call, or are these just two different ways to achieve the same objective (restricting access to data)?

Highlighted

Re: Best-practice for using VLANs or network access control

Knox is indeed a bottleneck (although you can use multiple Knox servers in parallel to spread load if applicable). However, requesting and receiving reports is a great use case for Knox since only a small amount of data goes in (the job request), only a small amount of data comes out (the report), and there may be many report requestors who need that level of access but don't have and shouldn't need access to the details of the cluster. Those users can be given Knox access via established Active Directory accounts, and don't have to mess with Kerberos or know about cluster details.

David's answer includes the fact that applications transporting large amounts of data into or out of the cluster will need direct access to many or perhaps all of the cluster servers.

Highlighted

Re: Best-practice for using VLANs or network access control

@David Kaiser The use of multiple networks with "multi-homed" HDP cluster servers is commonplace in enterprise environments. The motivations include partitioning traffic for security, bandwidth, or other management reasons; as well as improving availability or bandwidth through redundancy. Setting up a multi-homed cluster requires careful attention to some additional parameters. For help, please see the HCC article Parameters for Multi-Homing. (If this article is useful to you, please up-vote it :-)

Don't have an account?
Coming from Hortonworks? Activate your account here