A SaaS company would like to establish and document explicit access methods and controls for network access from clients to their primary HDP cluster and between their primary and backup clusters.
This approach requires all nodes are on two VLANs, to isolate users from internal connections. Are there best-practices for VLAN implementations for HDP?
Proposed Network/VLAN Solution
|Users to UIs (Ambari, Ranger, etc.)||HQ||Hadoop master nodes||VLAN1||VPN, Login|
|User jobs (Hive, PowerBI, etc.)||HQ||Hive server nodes||VLAN1||fw (users)|
|Inter-Hadoop comms||Master Nodes||Worker Nodes||VLAN2||none|
|Prod dB insert (Sqoop)||SQL Server||Hadoop primary||VLAN2||no fw|
|Cluster replication (Falcon)||Hadoop primary||Hadoop backup||VLAN2||static IP route, no fw|
|Monitoring (external)||All Hadoop nodes||Monitoring Systems||VLAN1||fw (outbound only)|
|External application jobs||Our app servers||Hadoop master nodes||VLAN1||fw (IP range, ports)|
The second question is: Based on the above table, if an application wants to run a Hadoop job to get a report, does that application need to talk to both master nodes and all data nodes, or does it just talk to a single master node? Since the customer is a SaaS company, the customers internal applications would require that many servers be able to reach all Hadoop nodes, which is problematic from a security standpoint.
For the second question...
1. Applications that connect to services will need network access to any server running that service. (Hive clients connect to the Hive Server node, etc.)
2. Applications that access the HDFS file system using native HDFS RPC protocol will require network access to any Namenode or Datanode in the cluster. Most ETL or data migration integrations run into this. (External application talks to Namenode, and is returned a list of Blocks to be retrieved from potentially any datanode.)
3. To avoid the need to allow access to any node in the cluster, the use of Apache Knox should be considered, based on the following:
If the external apps have latency requirements, e.g., real-time analytics, and high user concurrency, I don't see how this would work well.
Ideally, this should be configured like a database, where any app server behind a firewall that needs access can get to it, so access is controlled by the database vs. inserting a firewall (bottleneck) between the app server and database, which is a bad idea because this data path is often heavily-trafficked and highly latency-sensitive.
Couldn't Ranger be used to allow app service account access to zones vs. using Knox to verify every call, or are these just two different ways to achieve the same objective (restricting access to data)?
Knox is indeed a bottleneck (although you can use multiple Knox servers in parallel to spread load if applicable). However, requesting and receiving reports is a great use case for Knox since only a small amount of data goes in (the job request), only a small amount of data comes out (the report), and there may be many report requestors who need that level of access but don't have and shouldn't need access to the details of the cluster. Those users can be given Knox access via established Active Directory accounts, and don't have to mess with Kerberos or know about cluster details.
David's answer includes the fact that applications transporting large amounts of data into or out of the cluster will need direct access to many or perhaps all of the cluster servers.
@David Kaiser The use of multiple networks with "multi-homed" HDP cluster servers is commonplace in enterprise environments. The motivations include partitioning traffic for security, bandwidth, or other management reasons; as well as improving availability or bandwidth through redundancy. Setting up a multi-homed cluster requires careful attention to some additional parameters. For help, please see the HCC article Parameters for Multi-Homing. (If this article is useful to you, please up-vote it :-)