My organization has recently setup serveral Cloudera environments and are in the beginning phases of data ingestion. We originally predicted that most initial data sources would be hosted within our internal network, but we are quickly receiving Interest from business groups that would like us to integrate data from Web Services such as Google Analytics, Cloud based SaaS platforms, etc. (typically connected to through the use of API calls).
The issue with this is that due to security concerns, we are required to prevent the Cloudera platform from being directly exposed to the Internet.
I've been doing some research, but fail to find any options for ingesting data into the platform without opening direct access to one or more nodes (which will likely be impossible to get approval for due to the security concerns that I mentioned).
I'm curious if others have encountered this challenge and am intereseted in potential options (aside from purchasing a third party data ingestion tool that sits outside of the cluster). So primarily low-cost or no-cost options.
Some supplementary information is that we are planning on storing data primarily using HDFS (obviously), Hive/Impala, Kudu, and potentially Hbase if a use-case arises.
We do have IBM datastage internally for ETL to RDBMS data sources, although I personally would prefer to stay away from this tool since it's integration with Hadoop in general is lacking (and if we start down the path of utilizing it, it will likely get used for things beyond this use-case, which I feel is a really bad idea).
Any advice or assistance would be greatly appreciated.
It's hard to believe no one has run into this given that most use-cases are for Internet based sources and I can't imagine every organization exposes their internal data to the Internet...
There was a similar situation in our company where hadoop cluster is not exposed to internet. Luckily we have a Server placed in DMZ-DeMilitarized Zone which is used to send/receive feeds to external network. We call this as SFTP Server and this has connection to internet. We do consume data from Google Analytics, Adobe analytics, facebook etc.. using their API's. The code that invokes API sits in SFTP Server and the script for this is triggered by the hadoop's edge node. The Generated output files due to calling of API are copied back to Edge node and are then placed into Hadoop cluster. All this is automated. There are firewalls between Hadoop cluster and SFTP Server and between SFTP Server to Internet. Specfic ports were to be opened to consume data.