There was a similar situation in our company where hadoop cluster is not exposed to internet. Luckily we have a Server placed in DMZ-DeMilitarized Zone which is used to send/receive feeds to external network. We call this as SFTP Server and this has connection to internet. We do consume data from Google Analytics, Adobe analytics, facebook etc.. using their API's. The code that invokes API sits in SFTP Server and the script for this is triggered by the hadoop's edge node. The Generated output files due to calling of API are copied back to Edge node and are then placed into Hadoop cluster. All this is automated. There are firewalls between Hadoop cluster and SFTP Server and between SFTP Server to Internet. Specfic ports were to be opened to consume data.
... View more