My issue is i understand that any client application , in my case its java application which connects with hadoop for read/write needs to know hbase/hdfs connection details , so my client has hbase-site.xml hdfs-site.xml etc files copied from hadoop cluster .
in case of distributed systems where in i have 150 nodes in production for application then each node should be aware of updated hbase-site.xml hdfs-site.xml, in case by any chance if any of node is not updated with latest conf details of hadoop then it creates an issue
so i am exploring what can be better situation or centralized solution
1. should we store such hadoop conf data on shared file system but this solution does not look from cloud like aws POV ?
2. cant we leverage zookeeper for this ?
3. what about using consul for this purpose ?
can you give any suggestions or recommendations how client should communicate to hadoop cluster and where should the configuration data of hadoop cluster reside for client to use ?
Also Please give due considerations that hadoop cluster hortonworks is on non cloud platform and application in cloud or both hadoop and 100 application nodes in cloud
You don't always need hbase-site.xml/core-site.xml/hdfs-site.xml on your application's classpath (the default values are usually compatible), but it is best practice to explicitly include them because you as a user don't really need to worry about which ones are important. In my experiences, most applications end up running from a node within your datacenter and can pull from the de-facto copy of the configuration file in /etc/.
For your other points:
1. Nothing comes to mind that would prevent this from working, but have not seen this done before.
2. HBase could publish the necessary configuration data to ZooKeeper, but your client would still have to find the hbase.root.znode and the zookeeper quorum from somewhere (presently these come from hbase-site.xml). It's possible to do this, but it might take some effort. Most times, the information your clients need from hbase-site.xml never changes (ZK location). HDFS might be able to do the same, but I'm not as familiar to guess at how feasible it would be. Instead of HBase/HDFS changing, you would also be able to use ZooKeeper yourself to bootstrap a Configuration object in your applications instead of constructing an instance, initialized from the configuration files on the classpath.
3. I don't know enough about Consul to give a good answer, but I would assume you could use it in the same way you use ZooKeeper.
Thanks for your reply .
i am saying if i tell my java client below details without giving hbase-site.xml and hdfs-site.xml , will it be okay ?
i will tell below details to client
2. Zookeeper instances and ZK port
3. oozie/hive/hdfs connection details
so if i tell above things to client then do i still need hbase-site.xml and hdfs-site.xml ect to be present on client side ?
also 1 thing regarding your comment on storing conf data on shared file system , you think its good idea?
i am trying to figure out whats the best way to have updated information of hadoop/hbase configuration on client side , that is better configuration management on client side for hadoop cluster
With an out-of-the-box installation, yes, the above would probably be enough, but not in all cases. For example, if you configure HDFS HA (multiple NameNodes), the HDFS client requires information present in hdfs-site.xml to function. This would require investigation on configuration mgmt to make sure that clients have all necessary properties you override and that any future changes on the cluster are evaluated and determined if clients need to have those set.
Regarding shared filesystems, I have nothing but bad memories about using NFS in the past to share files, so I am not quick to recommend that approach.
If your users have the ability to access the cluster, you could automate extraction of a current copy of configuration files to place on their local machine for their application classpath. There are more fancy things you could do, but they would require additional programming effort (e.g. using ZooKeeper or Consul to share and recreate Configuration data).
so i think based on what @mqureshi replied below , it makes sense to have gateway node to handle all configuration data , but i am thinking then in my huge set up of say 100 nodes it would be overhead to maintain so many gateway nodes just for triggering hadoop jobs from gateway node.
Combine HBaseConfiguration here with HadoopConfiguration here and you should be able to do it without using config files. But I have always used client config files. They are easy and more over if config files change, all you have to do is replace the config files. not change your code.
in a 150 node cluster, only your client gate way nodes need to know of your config files as well updated config files. How many gateway nodes you have?
zookeeper will be leveraged and your client will connect to zookeeper getting information from hbase-site.xml.
thanks for your time and inputs
gateway input makes sense, i have 2 edge nodes in my cluster
you have mentioned that if config file changes just replace on all nodes, that i understand but concern here is exactly this that is how to better manage config file change from 1 place without affecting all client nodes
I am not sure i understand. If your config changes, your client has to know. There is no going around that. normally when your config changes, you will deploy it on edge nodes and then restart. That's only two nodes on your cluster. If your client needs those configs then you have to provide somehow. Best would be to do it in config files as you can just replace old files with new.
If I am not understanding your question right, can you please elaborate?