Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

Environment

Best practices dictate that, where possible, a Hadoop cluster should be maintained behind a firewall to minimize any potential security vulnerabilities that may arise from exposed ports and web interfaces. A common approach to enabling user access in this situation is to open up SSH into a set of gateway/edge nodes. This ensures that users must authenticate prior to accessing any pieces of the Hadoop ecosystem and implicitly encrypts all data sent between the client and the cluster. This is a common setup for vanilla cloud-based installations.

The problem with this setup is that, by default, all access is limited to the CLI on the gateway machines. Users outside of the cluster firewall cannot access valuable features such as web UIs, and JDBC/ODBC connections. There are a few options to securely enable these capabilities:

  1. Enable Kerberos+SPNEGO and Knox. Then open up the appropriate ports in the firewall.
  2. Implement firewall rules to expose specific ports and hosts to a subset of known client IPs.
  3. Leverage SSH tunneling to route traffic over an SSH connection and into the cluster.

This article focuses on #3. The best solution will vary on a case-by-case basis but SSH tunneling is the simplest and requires no intervention by OPs staff once SSH is enabled.

Accessing Web UIs via a SOCKS Proxy

You can use SSH to open a local port that connects to a remote environment and behaves like a SOCKS proxy. Once this tunnel is established, you can configure your web browser to use the proxy and all web traffic will be routed over the tunnel and into the cluster environment (behind the firewall where the environment is open). The following command will open a tunnel to the machine gateway.hdp.cluster which has SSH enabled:

ssh -D 8080 -f -C -q -N username@gateway.hdp.cluster

Parameters map to the following:

  • -D the local port to listen on
  • -f send this ssh operation into the background after password prompts
  • -C use compression
  • -q quiet mode --> suppress warnings and diagnostic messages
  • -N do not execute remote command or wait for the user to provide any commands

Once the tunnel is established, you can open your web browser navigate to the "Network Settings" tab. Under the proxy settings, enable the SOCKS proxy and enter localhost and port 8080. Now all web traffic from your browser will be routed over the tunnel and appear as if it is coming from gateway.hdp.cluster. You should be able to load web UIs that are behind the firewall such as Ambari or the Namenode UI.

Establishing an ODBC/JDBC connection vi SSH Tunnel

For an ODBC/JDBC connection, the behavior we want is a bit different than the previous sections. We want to map a local port to a port on a remote machine within the firewall, specifically the HiveServer2 port. We can do that as follows:

ssh -L 10000:hiverserver2.hdp.cluster:10000 username@gateway.hdp.cluster

Now, an application on the client can connect to localhost on port 10000 and, to the application, it will appear as if it is connecting directly to hiveserver2.hdp.local on port 10000. Under the covers data is actually going over the SSH tunnel to gateway.hdp.cluster and then being routed to port 10000 on the hiveserver2.hdp.cluster node.

To configure the ODBC/JDBC connection on the client simply use localhost and port 10000 in place of the hiveserver2 host as part of the JDBC/ODBC connection parameters.

77,801 Views