Created 10-29-2015 01:12 AM
In the current production environment we have 20 data nodes. We are using Sqoop to import data from Netezza into Hadoop. We opened firewall between the Netezza server and the 20 data nodes for Sqoop to work.
We are planning to add 40 new data nodes. For Sqoop functionality not to break, we need to open new firewall rules for all the new nodes. We are also getting requests to import data from other databases such as Teradata and Oracle into Hadoop. As we have firewalls in place, it is hard to maintain firewall rules between the databases and individual data nodes. Are there any alternative solutions to this problem, for example using a gateway node.
Created 10-29-2015 01:39 AM
I've never tried this approach, think of it as a science experiment.
Set up a node label, label the 20 existing hosts and create a queue that defaults to that node label, submit Sqoop jobs to that queue alone. Your Sqoop jobs will only run on the existing 20 nodes.
You could also go narrower and only have 1 host do the imports. Be careful because HDFS usage on that node will become much higher if you don't balance.
Created 10-29-2015 01:39 AM
I've never tried this approach, think of it as a science experiment.
Set up a node label, label the 20 existing hosts and create a queue that defaults to that node label, submit Sqoop jobs to that queue alone. Your Sqoop jobs will only run on the existing 20 nodes.
You could also go narrower and only have 1 host do the imports. Be careful because HDFS usage on that node will become much higher if you don't balance.
Created 12-09-2015 03:00 PM
In order to simplify the firewall rules i would create one edge host to use as a gateway using ssh-tunnels, iptables or another network type software package to forward the requests using that hosts ip only. You can also approach your network team and get a NAT assigned to your hosts so they all appear to be the same IP when making outgoing requests.