02-24-2015 08:35 AM
I'm comparing the configuration files in /etc/hadoop/conf with the values listed in Appendix A of Hadoop The Definitive Guide 4th Edition (pre-release). I assume this is the same information that was in the 3rd edition. In any case, based on Table A-1 in the document and the contents of the common, HDFS and YARN configuration files, I can't tell if this virtual image is actually configured for standalone, pseudodistributed or fully distributed mode.
YARN: I cannot find property yarn.resourcemanager.hostname in any file in any directory under /etc/hadoop. This leads me to believe that YARN is configured for standalone mode. Is this the case and if so how does the nodemanager receive work instructions? What is the workflow in this case?
HDFS: The dfs.replication property is set to 1 in all hdfs-site.xml configuration files. This tells me HDFS is configured for pseudodistributed mode.
Common: The value for property fs.defaultFS in core-site.xml is "hdfs://quickstart.cloudera:8020". This appears to be the value for fully distributed mode.
Otherwise, I haven't noticed any problems with the image. I'm just acquainting myself with CDH 5, Hadoop, etc., at this time and learning how it's put together. The UI (Hue) works and I can get to specific UIs using their URIs/ports.
02-24-2015 09:10 AM
>> standalone, pseudodistributed or fully distributed mode
Psuedo-distributed is the term usually used, although there is little difference in the configuration between "pseudo-distributed" and "fully distributed" other than the fact that everything just happens to be running on the same operating system. Do note that different services sometimes use these terms slightly differently: HBase, for instance, is either in "standalone mode" (only the Master runs, and does everything) or "distributed mode" (regardless of if it's a single-node or multi-node install); I actually don't know that HDFS even has a "standalone mode" - you have to run at least a NameNode and a DataNode, and you can just choose to do both on the same machine. In the VM, everything should be running as separate daemons in their own JVMs, even though it's all on the same node.
If YARN is configured such that the NodeManager is not really doing anything separate from the ResourceManager I'll need to look into that. But I have run numerous MapReduce jobs on the QuickStart VM so I can tell you that things should generally work anyway.