Support Questions

Find answers, ask questions, and share your expertise

Hadoop cluster on private network?

From talking to Cloudera support people, I got an impression that they expect that all the Hadoop nodes, including data nodes, are on public network and the situation when Hadoop cluster is on private network and there are gateways for Hue, Cloudera Manager, ssh access, etc. is not supported. They immediately say that  "multihomed" configuration is not officially supported. That sounds completely crazy. I would expect 99.9% of Hadoop clusters running on private networks with a few gateways. I think "multihomed" in the documentation means something else: like Hadoop nodes cannot communicate between themselves on different networks but it has nothing to do with gateways between Hadoop cluster and public network. Any comments?

 

That raises another question. To fully enable TLS inside Hadoop (up to level 3) requires certificates and Cloudera recommends against using self-signed certificates. But can one get an officially signed certificate for a node that is not on public network, not in DNS? Or level 3 is an overkill for Hadoop running on a private network and it is enough to do level 1? Is level 1 sufficient to configure Kerberos? Documentation says that configuring TLS is a prerequisite for Kerberos.

 

 

12 REPLIES 12

Champion
Man, I responded, detailed, to your other post but it didn't make it through.

So TLS level 2 is required for Kerberos only in that it is need so that so that CM can pass around the keytab files when the wizard is used to setup Kerberos. This is not needed if you do it manually.

On nodes having different names/address but still needing to encrypt traffic, first CM only allows for one cert path and name. What you would need to do is create certs with subject alternative names so it works for both the public and private.

As for multi-homed, don't do it for the Masters and Workers. In theory it should be fine for the gateway roles (certainly for the actual Gateways, but HS2, Oozie, etc. may have issues). CM will be a pain. I don't have any experience to help you do it but it has to be done perfectly for all components. DNS resolution must be spot on.

I did run into an issue with just using a multi-homed node for CM. It was left over from what the machine had previously done and not enough effort went into to making sure it was perfect. So CM and CM agents had issues as they were talking over the wrong network, etc.

My experience is that most customers just have it on their public enterprise network. So it isn't public to the masses and Internet but full available to all business users.

But would not you want Hadoop nodes to communicate on a faster more expensive network than your enterprise network?

I my case, nodes communicate on 10G network but the enterprise network is 1G.

 

Champion
You do want the master and workers on the fastest possible network and majority of the traffic going from east to west. This can be achieve if a enterprise network is used. This depends on how the network is architected and configured. I am no network-wiz but if I recall from my early days as long as the Hadoop nodes are in the same broadcast domain it will use the physically connected 10G network as long as it can find the other node (DNS/hosts resolution) and shouldn't be going up beyond the L3 switch.

What if I pretend that there is only 10G internal network and use something external to Hadoop, like firewall, to forward traffic  from CM machine external interface on port 7180 to the internal interface so that I do not have to deal inside Hadoop with multiple networks? Might that work?

 

If I do have to generate a single certificate for two interfaces using keytool, what's the exact syntax?

 

Is it something like: 

keytool -genkeypair -keystore myhost.keystore -keyalg RSA -alias myhost -dname "CN=myhost.mydomain O=Hadoop" -storepass <pass> -keypass <pass>  -ext SAN=myhost2.mydomain2

?

When I tried to use the above command I got:

keytool error: java.lang.Exception: Key pair not generated, alias <myhost> already exists

 

Is

-alias myhost

the same as

-ext SAN=myhost

?

 

One can either use alias or -ext SAN but not both?

Champion
Does myhost.keystore already exist?

No, alias is just the name assigned to the entry in the keystore and SAN is a subject alternative name within the certificate itself.

I suspect that the keystore exist and has an entry with the alias myhost already.

Yes, that was the problem. Once I deleted the existing keystore the error disappeared.

Also one needs to prepend a hostname in SAN= with dns:

Champion

Should it be given in plain text in the header of pem file before "BEGIN CERTIFICATE" or is it actually encoded in the certificate gibberish?

I do not see it in the header. There is original hostname and alias but no alternative host name.

OK, found it. It is indeed encrypted inside the certificate itself. To see it I used

openssl x509 -in md01.pem -text -noout

Champion
So the CM UI at port 7180 is just to serve the CM UI. The rest of the communication between CM and the CM agents or CM Service and cluster happen over other ports. So for this example you would need to route all of the CMS and CM agent traffic the Hadoop network. Something similar for HUE could be done.

The below resources are better at covering what is needed. The gist is that you need to create a file with the SANs and then pass that to the openssl command. Well, this is for generating a CSR. The above command should work for a self-signed certificate. I haven't done it that way though as I stay away from SS certs.

https://geekflare.com/san-ssl-certificate/
http://wiki.cacert.org/FAQ/subjectAltName
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.