Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

bulk upload to HFDS with limited access to cluster from client side

avatar
Expert Contributor

Hi, each day we will get 10-20 GB of binary files.

We need to upload these files into HDFS. Also we want to limit access to cluster from client side (side which delivers 10-20GB files)
 
What are the best approaches?
 
We have several ideas:
 
1. SFTP on our side (for example one of our data-nodes) and then hadoop fs -put
2. hadoop fs -put from client side (who delivers data). But we would like to forbid direct remote access to cluster.
3. WebHDFS (is it working???) the problem is the same, we don't want give access to cluster or its interface to the client.
 
*And we don't want to establish kerberos or stuff like that, we have private secure network for the cluster.
1 ACCEPTED SOLUTION

avatar
Cloudera Employee

You could try using HttpFS, it acts as a trusted edge node between the cluster and external clients. It's basically a proxy for WebHDFS, so clients can't talk directly to the namenode / datanodes. This is lower performance, but it should be okay for 10-20GB of data.

 

See:

http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-hdfs-httpfs/

View solution in original post

1 REPLY 1

avatar
Cloudera Employee

You could try using HttpFS, it acts as a trusted edge node between the cluster and external clients. It's basically a proxy for WebHDFS, so clients can't talk directly to the namenode / datanodes. This is lower performance, but it should be okay for 10-20GB of data.

 

See:

http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-hdfs-httpfs/