1. I download data to a local system where oozie is running
2. I have HDFS on 2 servers now (added second one recently)
3. I have been moving data from local to HDFS using HDFS copyFromLocal command.
4. Once it is done, I load the data moved using LOAD command
It has worked alright for as long as I had a single host cluster. Now, my question is if it ever could be a problem in distributed environment? I mean thinking of it, now that I have 2 HDFS nodes, data could be moved to any of them right? How is oozie/pig going to know which host to pick that data from now?
May be my doubt is too naive but I would appreciate an explanation on this. Thanks.
The great thing about hadoop is that details of where to distribute the data across the cluster is handled by the namenode and these details are hidden from the client, which only needs to know a filepath. (URL for namenode is configured in hdfs-default.xml which is used by clients to talk to namenode). The namenode decides which nodes to store blocks of data and where to replicate them, and it holds the metadata on these details. Whether you are using hdfs commands in linux, running pig scripts, oozie etc, these clients all communicate a filepath to the namenode and the namenode knows where this data is distributed as blocks among the cluster and takes care of reading and writing operations. This occurs dynamically as you add and remove data nodes from your cluster.
So ... no concern here.
If this is what you were looking for, let me know by accepting the answer; if there are gaps or you have follow-up questions, please let me know.
But the data to be copied on hdfs nodes is on local fs of one system only. the jobs fail when running on other nodes since the specified file is not on their local system