Had some question about operational best practice:
a.) For external customers, what's the best way to allow them to upload large size data to HDFS. Uploading via flile browser web interface may not be safe, reason being if there are 10 users start uploading 20gb of data at the sametime. Server will choke ( Currently I have small 5 server cluster).
b.) Was thinking of having a jumpbox externally, where people can ssh to it and ftp their data and a cron job will then push the data to HDFS on a periodic basic. Once the upload, users can use web interface to program using Hiv/Pig
c.) Spark-Shell - Is there a way to have users initiate spark-shell from a web interface.
d.) Currently NameNode is a single point of failure. I was reading about federated service or use HA. What's recommended. I have a very small environment
e.) DataNode information, cluster information, spark job are all can be viewed from web. Is it a good practice to allow users see those information ? Issue is information is not restricted to just their information. It's open for all or none.