Member since
10-22-2015
83
Posts
84
Kudos Received
13
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
650 | 01-31-2018 07:47 PM | |
1421 | 02-27-2017 07:26 PM | |
1221 | 12-16-2016 07:56 PM | |
4632 | 12-14-2016 07:26 PM | |
2144 | 12-13-2016 06:39 PM |
03-01-2018
07:56 PM
2 Kudos
Hi @Zack Riesland , thanks for the clarification. I understand now that s3-dist-cp is used both coming and going, and I agree it doesn't seem to have any renaming capability built in. I strongly suspect there's a way to control the filenames by controlling the Hive partition name, but I'm not a Hive expert and maybe the application benefits from using the same partition name for raw and clean data. Here's what I found that might help: a StackOverflow query that confirms your complaint that CLI rename of HDFS files is slow, and confirms my sense that it shouldn't be: https://stackoverflow.com/questions/14736017/batch-rename-in-hadoopThe second responder (Robert) wrote a 1-page java program that uses the hadoop java api, and shows that it can rename several thousand files in the same 4 seconds that the CLI takes to rename one file. This suggests that the majority of time is taken up by the connection protocol, and the java program can use a single connection for multiple operations. I also looked at options for using Python similarly. It appears that the hdfs3 library would do quite nicely, although you have to separately install the underlying C library opensourced by Pivotal. Full docs for installation and api are given in the linked doc. Please note it probably would not work to use the similarly-named "hdfs" library, because the latter uses the WebHDFS interface, which is unlikely to be faster than the CLI (although I didn't test it). Since I'm a Hortonworks employee, I should point out that the above opensource references are not supported or endorsed by the company, and are used at your own risk. I did look over the Java program and it looks fine to me, but that's my personal opinion, not the company's. Good luck. If that doesn't get you where you need to be, and since your profile popup indicates you are a Hortonworks support customer, you'd be welcome to open a support request for advice from an SE.
... View more
02-28-2018
08:05 PM
@zackriesland , you don't say what application you're using to pull and clean the new data on HDFS. The name pattern "data_day=01-01-2017/000000_0" is under control of that application, and can be configured there. It's not an HDFS thing; HDFS is just a file system, and stores files under whatever filenames the app tells it to use. If it is not acceptable to change the naming configuration in that app, there are other possible approaches. You say that renaming the files takes 3 seconds per chunk file. Am I correct in assuming that is using S3 commands to rename the file after uploading to S3? If so, you could try renaming the files on HDFS before uploading to S3. You can use the "hadoop fs -mv" CLI command, very much like the linux "mv" command, to rename files on HDFS from a script, and it should not take any 3 seconds. You can use this command in a shellscript while logged into any server in the HDFS cluster. Another way to run such a script is on a workstation with, for instance, the Python HDFS Client library installed, and configured to be able to talk to the HDFS cluster. Finally, you're probably aware that filenames in S3 are actually just strings, and the slash ("/") to delimit "folders" is purely convention, except for the delimiter between the bucket name and the rest of the filename. There aren't actually any folders below the level of the bucket. So changing the "dest" location for the s3-dist-cp command to put it in a sub-folder is really just the same as specifying a prefix (the name of the "sub-folder") on the filename. Use the date, or a serial number, or whatever convention you want so each invocation of s3-dist-cp puts its files on S3 with a different prefix. It seems that would also fulfill the requirement above. Hope one or another of these ideas helps.
... View more
01-31-2018
07:47 PM
1 Kudo
@Michael Bronson - greetings to a fellow Simpsonified! If the Ambari server is down, then there are obviously problems with the system, and it is a little tricky to say what to do -- A complete answer would look more like a decision tree than a single response. Nevertheless, I'll try to provide some help, with the understanding that a complete answer is beyond the scope of a simple Q&A. First off, why not restart the Ambari service? That's by far the simplest answer. Give it a few minutes to check its connectivity with the other services, and you should be able to proceed via Ambari. Really, this is the only good solution. If you really need to do it the hard way, there are two basic choices: (a) If you know a little about each of the services, you can use service-specific CLIs on each of the masters to check status and/or stop each service. (b) Alternatively, you can use the fact that these services essentially all run in Java and are typically installed on "hdp" paths in the filesystem, to use linux (or OS equivalent) commands to find and kill them. Option (a) requires a little research, and depends on which services are running in the cluster. For instance, you can log into the HDFS Master and use the commands summarized here: https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/ClusterSetup.html#Hadoop_Shutdown to shutdown the core Hadoop services. Each of the other major components have similar CLIs, which some diligent websearching on the Apache sites will find. They allow querying the current status of the services, and provide reasonably clean shutdowns. The final option assumes you are a capable Linux administrator, as well as understanding how HDP was installed. If not, you shouldn't try this. Option (b) is rude, crude, and depends overmuch on the famed crash-resilience of Hadoop components. Do NOT take this answer as Hortonworks advice -- to the contrary, it is the equivalent of pulling the plug on your servers, something you should only do on a production system if there are no other alternatives. But the fact is, if you have admin privileges, doing `ps auxwww|grep java|grep -i hdp` (assuming all services have been installed on paths that include the word 'hdp' or 'HDP') on each host in the cluster should elucidate all HDP-related processes still running (and maybe some others; do check the results carefully). If you see fit to kill them ... that's your responsibility. Very important if at all possible to quiesce the cluster first, at least by stopping data inputs and waiting a few minutes. Note the Ambari Agent is typically installed as a service with auto-restart; it is resilient and stateless, so not necessary to stop it before rebooting the server, but you could run `ambari-agent stop` (on each server) to make sure it stays out of the way while you work on the other services. Rebooting the server should restart it too. Hope this helps.
... View more
11-30-2017
08:45 PM
There's a third option: You could use the InvokeHTTP processor. In it, use the Expression Language to construct the RemoteURL, to directly build an AWS S3 web API call, to get the listing you need. AWS web API is documented here: http://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html However, as that document says, "Making REST API calls directly from your code can be cumbersome. It requires you to write the necessary code to calculate a valid signature to authenticate your requests. We recommend the following alternatives instead: \[AWS SDK or AWS CLI\]" -- which are the approaches I gave above.
... View more
11-30-2017
08:30 PM
1 Kudo
Hi @Identity One , You're right, it's unfortunate the ListS3 processor doesn't accept an incoming flowfile to provide a prefix. You might want to enter an Apache jira (https://issues.apache.org/jira/projects/NIFI : "Create" button) requesting such a feature. In the meantime, you can use the ExecuteScript processor to construct and run an appropriate s3 list command, with prefix, in response to incoming flowfiles. You can use any of the ExecuteScript allowed scripting languages that have a corresponding AWS SDK client library; see https://aws.amazon.com/tools/#sdk -- looks like the choices are python, ruby, or js. If you click through any of those links in the aws sdk page, under "Documentation" : "Code Examples" : "Amazon S3 Examples", it should help. For example, the Python S3 examples are here: https://boto3.readthedocs.io/en/latest/guide/s3-examples.html . The "list" api is here: https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2 If you're not into scripting to the SDK, you could use your favorite scripting language's escape to shell (like Python's 'subprocess' feature), and invoke the AWS S3 command-line-interface commands (which you're evidently already familiar with). It's crude (and slower) but it would work 🙂 Hope this helps.
... View more
09-13-2017
07:14 PM
The 7.2 instructions should work fine on 7.3.
... View more
08-09-2017
07:43 PM
@R Patel , if the above is a satisfactory answer, please accept it? Thanks.
... View more
08-07-2017
06:36 PM
@R Patel , responding to your follow-up question about adding a Java parser: When METRON-777 is added to Metron, which I expect will be within this week, there will be a true plug-in architecture for new parsers, along with a Maven Archetype and documentation for creating new parsers. Because it's coming very soon and will completely change the answer to your question, I encourage you to wait for it. However, for the sake of completeness, here is a brief answer for the current state of the Metron code: In the absence of a true plugin architecture, you simply have to add your new parser to the other Metron parsers. 1. Start by using git to clone the Metron codebase from https://github.com/apache/metron/ 2. In your favorite IDE, navigate to metron/metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/ , and you will see that each of the existing parsers is a self-contained java package. Add a new package titled with the name of your parser. 3. The `bro` parser is a fairly simple example to follow. It demonstrates that your parser should extend `BasicParser` and provide implementations of its abstract methods. 4. However, the `bro` parser has a null configure method. Working in depth with Metron's configuration capability is too complex to include as part of this answer, but the `csv` parser provides a simple configuration method, and the `snort` processor a more sophisticated (but still follow-able) example of configuration. The configure() method is invoked automatically by the infrastructure at instance startup, and provides the opportunity for the parser instance to read its config. 5. By "your json file" I assume you mean the configuration. In general, rather than placing it in the code base, you would add it to the Zookeeper configuration for Metron, under the PARSER ConfigurationType, by using the same configuration tools used to add your custom config to the grok parser, but under the "sensor name" of your new parser. That should be enough to give you a start. However, I reiterate that by the time you get well into such an implementation, METRON-777 should be available and you would want to switch gears to the plugin model. Hope this helps.
... View more
08-01-2017
06:09 PM
Hi @R Patel, Sebastien provided one of my favorite three documents on this topic. The other two are: https://cwiki.apache.org/confluence/display/METRON/2016/04/25/Metron+Tutorial+-+Fundamentals+Part+1%3A+Creating+a+New+Telemetry Similar to the previous citation, it is an end-to-end example using Squid logs as the example data, and creating a Grok parser to parse them. https://metron.apache.org/current-book/metron-platform/metron-parsers/index.html Provides much more detail about how the parser fits into the overall Metron infrastructure, especially configuration, and a little guidance about Java parsers. In a future release, Parsers will be a plug-in, with a maven archtype to help create new ones (Apache Jira METRON-777). So keep an eye out for that. Also you may want to join the Apache Metron user or dev mailing lists. See http://metron.apache.org/community/
... View more
03-21-2017
06:47 PM
@wzheng , This sounds like a question preparatory to using it in production. The sandbox really is mostly set up for single-node single-user exploration. However, our cloud offerings are intended for production use, and are available for HDP on your choice of AWS, Azure, and GCP; see https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-1.6.3/index.html
... View more
03-13-2017
06:21 PM
2 Kudos
@Derrick Lin Please take a look at article https://community.hortonworks.com/content/kbentry/24277/parameters-for-multi-homing.html There is more to multi-home configuration than is in the HDFS document, and the more complete discussion may help you resolve your problem, especially wrt DNS and naming. Biggest question being, do the cluster hosts have the same name on all networks? Hope this helps.
... View more
03-10-2017
12:20 AM
@dbalasundaran , Part of the point of using Ambari is so you don't need to worry about which files a client needs, which may change from release to release. The GUI allows you to specify where the downloaded tarball goes, and both Ambari and the tar extraction command will complain if the tarball file is corrupted or incomplete. So you should be able to use it with confidence. Of course it's not beyond possibility that the Ambari specification could be buggy, so if you think you're seeing a wrong result in your client download, by all means bring it up. Have you seen a problem that makes you concerned? Or are you just trying to be thorough? Thanks.
... View more
03-01-2017
10:30 PM
1 Kudo
Hi @Sedat Kestepe, regarding "third party and custom apps", this is of course outside the scope of my expertise. But I think that, yes, you will have to find out (a) whether they are even capable of being configured to use non-default network; (b) if so, how to so configure them; and (c) if desired, whether they can use the "0.0.0.0" config, meaning "listen on all networks" -- which is often a cheap way to get desirable results, altho in some environments it has security implications. Your second question is "How is service bound to preferred interface." Let's be clear that there are two viewpoints, the service and the client. From the service's viewpoint, "binding interface" really means, "what ip address(es) and port number shall I LISTEN on for connection requests?" Regrettably there is no uniformity in how each application answers that question, even among the Hadoop Stack components -- else that article could have been about half the length! Some applications aren't configurable and always use the host's default ip address, and a fixed port number. Sometimes the port number is just the starting point, where the app and client negotiate yet another port number for each connection. Sometimes a list of ip addresses can be configured, to listen on a specific subset of networks available on multi-homed servers. Sometimes the "magic" address "0.0.0.0" is understood to mean "listen on ALL available network ip addresses". (This concept is integrated with most tcp/ip libraries that applications might use.) Almost always, though, a single application port number is used by a given service, which means only one instance of the service per host (although it can be multi-threaded if it uses that port negotiation trick). How the ip address(es) and port number are specified, in one or several config parameters, is up to the application. As noted in the article, sometimes Hadoop Stack services even use their own hostname and run it through DNS to obtain an ip address; the result is dependent on which DNS service is queried, which in turn may be configured by yet another parameter (see for example the Kerberos _HOST resolution feature). From the client's viewpoint, it must answer the question "where do I FIND that service?" Clients are usually much simpler and are configured with a single ip address and port number where it should attempt to contact a service. That isn't necessarily so; clients often have lists of servers that they try to connect to in sequence, so they could have multiple ip addresses for the same server. But usually we prefer to put complexity in services and keep clients as simple as possible, because clients usually run in less controlled environments than services. It never makes sense to hand "0.0.0.0" to a client, because the client is initiating, not listening. The client has to know SPECIFICALLY where to find the service, and it is typically (tho not necessarily) different than the client's host server. We encourage use of a hostname rather than ip address for this purpose, at which point the client will obtain the ip address from its default DNS server, or (in some cases) a configuration-specified DNS server. Regarding your third question, in the "Theory of Operations" section, items 5, 6, and 7 should be taken together. #5 is an example of why it is useful and even necessary to have the same hostname on all networks: "external clients [can] find and talk to Datanodes on an external (client/server) network, while Datanodes find and talk to each other on an internal (intra-cluster) network." If the particular Datanode has a different hostname on the two networks, all parties become confused, because they may tell each other to talk to a different Datanode by name. #6 actually gives you an escape hatch: "(Network interfaces excluded from use by Hadoop may allow other hostnames.)" The "one hostname for all interfaces" only applies to the interfaces used for the Stack. An interface used only by administrators, for example, not for submitting or processing any Hadoop jobs, could assign the server a different hostname on that network. But if you violate #5, you threaten the stability of your configuration. #7 describes precisely the constraints within which you can use /etc/hosts files instead of DNS. It should perhaps mention that, for example, it would work to have the cluster members know their peers only via one network, and know its clients via one other network. The clients would know the cluster members by the second network, with the SAME hostnames, and that's okay. As long as clients and servers have different /etc/host files, and you use hostnames not IP addresses for configurations, it can often work. And as #7 points out, if these constraints don't work for your needs, it isn't that hard to add DNS. The example you gave is a possible result, yes. More broadly, all the servers would share /etc/hosts files in which all the servers are identified with an ip address on the intra-cluster network:
namenode2 192.168.1.2 datanode3 192.168.1.3 datanode4 192.168.1.4 datanode5 192.168.1.5 # etc. while the clients would share /etc/hosts files in which those servers are identified with an ip address on the client/server access network:
namenode2 10.0.9.2 datanode3 10.0.9.3 datanode4 10.0.9.4 datanode5 10.0.9.5 # etc. And this is a good example of why you MUST use hostnames not ip addresses, and dfs.client.use.datanode.hostname must be set true, because the clients and servers must be able to specify cluster servers to each other, and the ip addresses won't be mutually accessible.
... View more
02-27-2017
07:26 PM
2 Kudos
@Sedat Kestepe , Network bonding is done at the OS level, before Hadoop or other Application-level software knows anything about it. So if you have a properly configured bonded network, you can configure it into Hadoop just like any other network. The only possibly tricky thing depends on whether the bonded network is the ONLY network available on the server (which therefore means it is the default network), or if there are other networks also available in the server, and maybe some other network is taking the "eth0" default designation. If so, you can either force the bonded network to be the default network (how you do that is dependent on the OS, the NIC, and the BIOS, so I won't try to advise; please consult your IT person), or you can configure Hadoop to use the non-default network. For information on how to configure a non-default network, you might want to read the article at https://community.hortonworks.com/content/kbentry/24277/parameters-for-multi-homing.html This article is for "multi-homed" servers (servers with multiple networks) in which you want to use more than one of the networks. But even if you only want to use ONE of the networks (in this case, your bonded network), if it is not the default network then you still need to understand the network configuration parameters in Hadoop, etc. To get the name and ip addresses of all networks on the server, give the command `ifconfig` while logged in on the server in question. Hope this helps. If you still have questions after reading the article, please ask follow-up questions here, and include the `ifconfig` output unless it's not relevant.
... View more
12-16-2016
10:24 PM
Hi @Arsalan Siddiqi , yes, this is what I expected for Putty. By using the Putty configuration dialogue as shown in your screenshot, you ALREADY have an ssh connection to the guest VM, so typing `ssh` in the Putty terminal window is unnecessary and `scp` won't work -- instead of asking the host OS to talk to the guest OS, you would be asking the guest OS to talk to itself! I'm glad `pscp` (in a cmd window) worked for you -- and thanks for accepting the answer. Your last question, "why when i connect via Putty to root@127.0.0.1 p 2222 in windows and do a "dir" command, i see different folders as to when i do a DIR in virtualbox terminal??" is a puzzler. I would suggest, in each window (Putty terminal and Virtualbox terminal), you type the linux command `pwd`, which stands for "print working directory". It may be that the filesystem location you are ending up at login in the two tools is different. Also you can try the linux `ls` command.
... View more
12-16-2016
08:16 PM
(Got this wrong the first time! The following should be correct.) BTW, if your clusters have been inadvertently configured for Federation, and you want to disentangle them, it can be a little tricky. You can't just suddenly change all their configurations, because that would cause the namenodes to lose track of blocks on the remote datanodes. The needed process is too complex to detail here, but basically you need to do something like this:
The test for whether this problem exists is to open each NameNode's web UI ( http://<NameNode_FQDN>:50070 ) and navigate to the DataNodes page, then see if datanodes from the OTHER cluster are shown. If so, the two clusters have Federated. To separate them, first, change the configuration of each Namenode so that it stops serving namespaces from both clusters. I believe it would be best to reconfigure both Namenodes in each HA grouping at the same time, so a brief downtime will be needed. Then decommission each of the datanodes individually, from all clusters/namenodes, a few at a time (where "few" means a small percentage of the overall cluster size), so they flush their blocks to other nodes. Then, after each datanode is no longer referenced by any namespace, you change ITS configuration to only register to the Namenodes/Nameservices on its own cluster. Restart it, join it to its cluster, and continue with the next datanode. Alternate between the datanodes on the two clusters, so you don't run into storage space issues. Continue until all the datanodes have been so "gently" reconfigured to a single cluster's namespace(s). At the end of the process, all the nodes in each cluster will have a non-Federated config. This can be done without data loss and only brief NN downtime, but it will require significant time commitment, depending on the cluster sizes. Please don't try this unless you thoroughly understand the issues involved. If you have any doubts, consulting assistance may be appropriate.
... View more
12-16-2016
07:56 PM
1 Kudo
Hi
@jbarnett, this was really interesting to think about. The reason is that your question crosses the domains of three HDFS features: HA, Federation, and Client ViewFS. The key understanding required is that Clusters and Clients can, and sometimes should, be configured differently.
The blog you reference specifically says that the configuration he proposes is "a custom configuration you are
only going to use for distcp" (my emphasis added), and he specifically proposes it as a solution for doing distcp between HA clusters, where at any given time different namenodes may be the "active" namenode within each cluster that you're distcp-ing between.[1] He does not mention a critical point, which is that this configuration is for the Client only, not for the HDFS Cluster nodes (and the Client host where this is configured must not be one of the HDFS node servers), if you wish the two clusters to remain non-Federated. The behavior you describe for rebalancing would be the expected result if this configuration were also used for the cluster nodes themselves, causing them to Federate -- and in this case, yes, the undesired result would be balancing across the two clusters regardless of their membership. It is also probable, although I'm not certain, that if you give the `hdfs balancer` command from a Client with this configuration, that it will initiate rebalancing on both clusters; whether each cluster rebalances only within itself or across cluster boundaries I'm not 100% sure, but since the balancing is delegated to the namenodes, I would expect that if the clusters are each using a non-Federated configuration, then they would balance only within themselves.
So the simple answer to your question is: To keep the two clusters separate and non-Federated from each other, the nodes of each cluster should be configured with the hdfs-site.xml file appropriate to its own cluster, and not mentioning the other cluster. Since each cluster is itself HA, it will still need to use the "nameservice suffix" form of many of the configuration parameters, BUT it will only mention the nameservice(s) and namenodes hosted in that cluster; it will not mention nameservice(s)/namenode(s) hosted in the other cluster. (And, btw, the nameservices in the two clusters should have different names -- "fooSite1" and "fooSite2" -- even if they are intended to be mirrors of each other.) Your client hosts should be separate from your cluster nodes, and the two-cluster ViewFS Client configuration you use for invoking `distcp` should be different from the single-cluster ViewFS Client configurations you use for invoking `balancer` on each individual cluster, if you want to be able to trigger rebalancing separately. Make sense?
To get a clear understanding of these features (HA, Federation, and Client ViewFS), and how they relate to each other, I encourage you to read the authoritative sources at:
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/Federation.html
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/ViewFs.html Hope this helps.
-------------------
[1] He doesn't seem to include the final command which is the whole point of his blog: That after configuring the Client for a ViewFS view of the two clusters, you can now give the `distcp` command in the form:
hdfs distcp viewfs://serviceId1/path viewfs://serviceId2/path
and not have to worry about which namenode is active in each cluster.
... View more
12-15-2016
10:40 PM
@Arsalan Siddiqi , Sorry you're having difficulties. It's pretty hard to debug access problems remotely, as you know, but let's see what we can do. Let me clarify a couple things: Is it correct that you are able to use: ssh root@127.0.0.1 -p 2222 to connect over ssh in putty, but when you use: scp -P 2222 mylocalfile.txt root@127.0.0.1:/root/ it rejects the connection? Are you using all the parts in the above statement? The port: "-P 2222" (capital P, not lower case as with ssh). The root user id "root@" The colon after the IP address "127.0.0.1:" The destination directory (/root/ in the example above, but you can use any absolute directory path) If you've been using "localhost" as in the tutorial, try "127.0.0.1" instead. If you've been cut-and-pasting the command line, try typing it instead. The "-" symbol often doesn't cut-and-paste correctly, because formatted text may use character "m-dash" or "n-dash" instead of "hyphen". It is much safer to type it than to paste it. You are using Windows, correct? I'm a little confused about how you're connecting through Putty, as I remember Putty wanting the connection info in a dialogue box before the connection, whereas on a Mac or Linux box, the terminal application just opens a terminal on the box itself, and you then FURTHER connect via typing the ssh command. So, did you actually configure Putty with the "ssh" request, the port number, and the user and host info, in a dialogue box rather than typing an ssh command line? And did that work correctly? Assuming the answer to the above is "yes" and "yes", the next question is: Where are you typing the "scp" command? You can't type it into the Putty connection with the VM, that won't work. The scp command line is meant to be sent to your native box. Does Putty have a file transfer dialogue box that can use scp protocol? Is that what you're trying to use? Or have you downloaded the "pscp.exe" file from putty.org, and are using that? See http://www.it.cornell.edu/services/managed_servers/howto/file_transfer/fileputty.cfm
The full docs for Putty PSCP are at https://the.earth.li/~sgtatham/putty/0.67/htmldoc/Chapter5.html#pscp-usage and shows that pscp also takes a capital "-P" to specify the port number. Worst case, if you can't get any of these working, you've already established the Virtualbox connection for the VM. WIth some effort you can figure out how to configure a shared folder with your host machine, and use it to pass files back and forth.
... View more
12-14-2016
07:26 PM
2 Kudos
@Arsalan Siddiqi, I'll try to answer the several pieces of your question. First, I encourage you to go through the Sandbox Tutorial at http://hortonworks.com/hadoop-tutorial/learning-the-ropes-of-the-hortonworks-sandbox/ It will help you understand a great deal about the Sandbox, and what it is intended to do and how to use it. The sandbox comes with Ambari and the Stack pre-installed. You shouldn't need to change OS system settings in the Sandbox VM, which is intended to act like an "appliance". Also, the sandbox already has everything Ambari needs to run successfully, over the HTTP Web interface. No need to configure graphic packages on the sandbox VM. Ambari provides a quite nice GUI for a large variety of things you might want to do with an HDP Stack, including viewing and modifying configurations, seeing the health and activity levels of HDP services, stopping and re-starting component services, and even viewing contents of HDFS files. While you can view the component config files at (in most cases) /etc/<componentname>/conf/* in the sandbox VM's native file system, please DO NOT try to change configurations there. The configs are Ambari-managed, and like any Ambari installation, if you change the files yourself, the ambari-agent will just change them back! Instead, use the Ambari GUI to change any configurations you wish, then press the Save button, and restart the affected services (Ambari will prompt you). The data files for HDFS are stored as usual in the native filesystem location defined by HDFS config parameter "dfs.datanode.data.dir". However, it won't do you much good to go look there, because the blockfiles stored there are not readily human-understandable. As you may know, HDFS layers its own filesystem on top of the native file system, storing each replica of each block as a file in a datanode. If you want to check the contents of HDFS directories, you're much better off to use the HDFS file browser, as follows: In Ambari, select the HDFS service view, and pull down the "Quick Links" menu at the top center. Select "Namenode UI". In the Namenode UI, pull down the "Utilities" menu at the top right. Select "Browse the file system". This will take you to the "Browse Directory" UI. You may click thru the directory names at the right edge, or type an HDFS directory path into the text box at the top of the directory listing. If you click on a file name, you will see info about the blocks of that file (short files only have Block 0), and you may download the file if you want to see the contents. Note that HDFS files are always immutable. HDFS files may be appended to, but cannot be edited. For copying files to the Sandbox VM, first make sure you can access the sandbox through 'ssh', as documented near the beginning of the Tutorial under "Explore the Sandbox"; then see http://hortonworks.com/hadoop-tutorial/learning-the-ropes-of-the-hortonworks-sandbox/#send-data-btwn-sandbox-local-machine . Hope this helps.
... View more
12-14-2016
06:19 PM
One way to see whether any Stack services use hardwired "/user" prefix would be to use Ambari to install the whole Stack on a lab machine. Make a change to "dfs.user.home.dir.prefix", to something other than "/user", during Install Wizard time, BEFORE letting Ambari do the actual installation, thus making sure everything from the beginning sees the non-default value. Let it install, start all the services and let them run a few minutes, then see if anything got created under /user/* in HDFS. Sorry I don't have time to do the experiment right now, but if you do please report the results back here as a comment for others to learn from. If you find services that apparently hardwire the "/user" prefix, I'll enter bugs against those components and try to get them fixed.
... View more
12-13-2016
06:48 PM
As usual, parameter changes only affect things going forward. If users have already been created in the default location, their home directories will not be magically re-created in the new location. This could cause problems, depending on whether processes use the parameter vs using hardwired "/user" prefix. For your site-defined users, you can just move their home directories with dfs commands. For the pre-defined Stack service users, you'll need to experiment to see whether they want their home directories to stay in /user or be moved to value(dfs.user.home.dir.prefix). I would start by leaving them in place, but that's just a guess.
... View more
12-13-2016
06:39 PM
2 Kudos
@Sean Roberts As noted above, this is controlled by the 'dfs.user.home.dir.prefix' parameter in hdfs-site.xml. However, since this is not a commonly changed parameter, it isn't in the default Ambari configs for HDFS. (It just defaults to the value from hdfs-default.xml.) To change this value in Ambari, do the following: In Ambari, select the HDFS service, then select the "Configs" tab. Within Configs, select the "Advanced" tab Open the "Advanced hdfs-site" section, and confirm that this parameter isn't already present there. Open the "Custom hdfs-site" section, and click the "Add property" link A dialogue pops up, inviting you to type a "key=value" pair in the text field. Enter: dfs.user.home.dir.prefix=/user Press the "Add" button, and the new entry will be converted into a standard parameter entry field. Now change the value of the field to whatever you want (no blank spaces in the path, please). Of course after changing configurations, you have to press the "Save" button at the top of the window. That should do what you need, I hope.
... View more
06-24-2016
05:47 PM
1 Kudo
Hi @Yibing Liu When you created your local repo, did you follow the instructions at http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_Installing_HDP_AMB/content/_using_a_local_repository.html ? Did you use tarballs, or reposync? Did you provide an HTML server able to serve the repo contents? Finally, did you adjust the "baseurl" configurations in your ambari.repo, HDP.repo, and HDP-UTILS.repo files to point to the local repo? (and the "gpgkey" configuration in ambari.repo, unless you've turned it off) I don't really understand your statement "i also updated the version/stack to make my ambari connect to my local repo". Adjusting the .repo baseurls should be all that is needed. Is this a fresh install, or are you adding a new host to an existing cluster? Just to eliminate another common source of error, what version of python to you have installed on the server, and is it in the PATH for the userid performing the install?
... View more
06-23-2016
08:35 PM
@sprakash
The fact that distcp works with some configurations indicates you probably have Security set up right, as well as giving you an obvious work-around. To try to answer your question, please provide some clarifying information: When you speak of mapred-client.xml, do you mean mapred-site.xml on the client machine? When you speak of changing the framework, do you mean the "mapreduce.framework.name" configuration parameter in mapred-side.xml? Do you change it only on the client machine, or throughout both clusters? The allowed values of that parameter are "local", "classic", and "yarn". When you change it to not be "yarn", what do you set it to? Do you have "mapreduce.application.framework.path" set? If so, to what value?
... View more
04-19-2016
07:04 PM
2 Kudos
As Benjamin said, strongly encourage you to establish your process with a small test cluster first. However, I do not expect problems with the data. Hadoop is written in Java, so the form of data should be same between operating systems, especially all Linux variants. Warning: Do not upgrade both operating system and HDP version all at once! Change one major variable at a time, and make sure the system is stable in between. So go ahead and change OS, but keep the HDP version the same until you are done and satisfied with the state of the new OS. The biggest potential gotcha is if you experience ClusterID mismatch as a result of your backup and restore process. If you are backing up the data by distcp-ing it between clusters, then this won't be an issue; the namespaceID/clusterID/blockpoolID probably will change, but it won't matter since distcp actually creates new files. But if you are trying to use traditional file-based backup and restore, from tape or a SAN, then you may experience this: After you think you've fully restored, and you try to start up HDFS it will tell you you need to format the file system, or the hdfs file system may simply appear empty despite the files all being back in place. If this happens, "ClusterID mismatch" is the first thing to check, starting with http://hortonworks.com/blog/hdfs-metadata-directories-explained/ for background. Won't say more because you probably won't have the problem and it will be confusing to talk about in the abstract.
... View more
04-14-2016
07:07 PM
3 Kudos
And just a few more relevant facts: 1. The write pipeline for replication is parallelized in chunks, so the time to write an HDFS block with 3x replication is NOT 3x (write time on one datanode), but rather 1x (write time on one datanode) + 2x (delta), where "delta" is approximately the time to transmit and write one chunk. Where a block is 128 or 256 MB, a chunk is something like 64KB if I recall correctly. If your network between datanodes is at least 1Gbps then the time for delta is dominated by the disk write speed. 2. The last block of an HDFS file is typically a "short" block, since files aren't exact multiples of 128MB. HDFS only takes up as much of the native file system storage as needed (quantized by the native file system block size, typically 8KB), and does NOT take up the full 128MB block size for the final block of each file.
... View more
04-13-2016
07:12 PM
4 Kudos
Java 6, 7, and 8 were quite FORWARD-compatible, meaning that programs that ran in earlier versions generally would run successfully in later versions. At the same time, with Enterprise software one cannot just assume such compatibility, one must test and certify. On the other hand, each new Java version has added language features that are not BACKWARD-compatible, meaning that a program that uses new Java 8 language features will not be able to run under Java 7. Thus it would require Java 8. To date, Hadoop has only required Java 7. I looked at the Apache docs, and I think the work being done in branch-2.8 is only to make sure it is COMPATIBLE with Java 8, not REQUIRING Java 8. At some point in the future, however, there will be a branch designated to use more efficient new Java 8 language constructs, and therefore require Java 8 to be installed in the server. Once the Hadoop community accepts the requirement for Java 8, from that version forward Hadoop will no longer run successfully in Java 7. In the meantime they are making sure that from hadoop-2.8 forward, it is at least assured of being compatible with Java 8, for users who prefer that. Looking back over the "JDK Requirements" in the install documentation for various versions of HDP: HDP-2.1 HDP-2.2 HDP-2.3 HDP-2.4 We see that Java 6 started to be deprecated in HDP-2.1, remained usable with HDP-2.2, but was not compatible with HDP-2.3. That's because in HDP-2.3 we started using versions of the Hadoop stack components that utilized Java 7 language features and therefore were no longer compatible with Java 6. Starting with HDP-2.3 we continued supporting Java 7, but also started certifying with Java 8. No Java 8 language features were used (so that we could still support Java 7), but we tested with both Java 7 and 8 and certified that HDP-2.3 and HDP-2.4 worked with them both. That is the situation so far today.
... View more
04-05-2016
06:46 PM
@Sunile Manjee Can you please clarify what you mean by "full restore" in this case? A truly full restore of HDFS would restore to a cross-cluster consistent Point-in-Time (PiT) snapshot for all datanodes AND namenode. Therefore the namenode's idea of existing blocks would match the blocks existing on the datanodes after the restore. If the PiT snapshot was not perfect (for instance, if some HDFS files were in the midst of being written during the process of taking the snapshot), then you might need to do some cleanup of the blocks for those affected files. But not via a reformat. In general, a reformat of HDFS erases everything in it, just as with reformat of any file system. So it seems unlikely this is what you want. If you really do want a reformat, then there is no need to do a restore first; it is all going away anyway. How was your snapshot taken, and how are you doing the full restore?
... View more
03-30-2016
04:48 PM
Hi @Sotiris Madamas , if Mats's answer met your needs, please mark it accepted.
... View more
03-23-2016
06:33 PM
9 Kudos
Introduction Motivation
In many production clusters, the servers (or VMs) have multiple physical Network Interface Cards (NICs) and/or multiple Virtual Network Interfaces (VIFs) and are configured to participate in multiple IP address spaces (“networks”). This may be done for a variety of reasons, including but not limited to the following:
to improve availability or bandwidth through redundancy
to partition traffic for security, bandwidth, or other management reasons
etc.
In many of these “multi-homed” clusters it is desirable to configure the various Hadoop-related services to “listen” on all the network interfaces rather than on just one. Clients, on the other hand, must be given a specific target to which they should initiate connections.
Note that NIC Bonding (also known as NIC Teaming or Link Aggregation), where two or more physical NICs are made to look like a single network interface, is not included in this topic. The settings discussed below are usually not applicable to a NIC bonding configuration which handles multiplexing and failover transparently while presenting only one “logical network” to applications. However, there is a hybrid case where multiple physical NICs are bonded for redundancy and bandwidth, then multiple VIFs are layered on top of them for partitioning of data flows. The multi-homing parameters then relate to the VIFs. This Document
The purpose of this document is to gather in one place all the many parameters for all the different Hadoop-related components, related to configuring multi-homing support for services and clients. In general, to support multi-homing on the services you care about, you must set these parameters in those components and any components upon which they depend. Almost all components depend on Hadoop Core, HDFS and Yarn, so these are given first, along with Security related parameters.
This document assumes you have HDP version 2.3 or later. Most but not all of the features are available in 2.1 and 2.2 also. Vocabulary
Server and service are carefully distinguished in this document. Servers are host machines or VMs. Services are Hadoop component processes running on servers.
Client: a process that initiates communication with a service. Client processes may be on a host outside of the Hadoop cluster, or on a server that is part of the Hadoop cluster. Sometimes Hadoop services are clients of other Hadoop services.
DNS and rDNS lookup: “Distributed Name Service” (alt. “Domain Name Service”) - The service that converts hostnames to IP addresses for clients wishing to communicate with said hostname; this is often referred to as “DNS lookup”. If properly configured, the service can also provide “reverse-DNS” (rDNS) lookup, which converts IP addresses back into hostnames. It is common for different networks to have different DNS Servers, and the same DNS or rDNS query to different name servers may get different responses. This is used for Hadoop Multi-Homing to provide a level of indirection associating different IP addresses assigned to a multi-homed server for clients on different networks.
Round-trip-able DNS: Within a single network’s DNS service, if one starts with a valid IP address and performs an rDNS lookup to get a hostname, then DNS lookup on that hostname should return the original IP address; or alternatively, if one starts with a valid hostname and performs a DNS lookup on it to get an IP address, then rDNS lookup on that IP address should return the original hostname.
FQDN: “Fully Qualified Domain Name” - the long form of hostnames, of the form “shortname.domainname”, eg, “host234.lab1.hortonworks.com”. This contrasts with the “short form” hostname, which in this example would be just “host234”. Use Cases
To understand the semantics of configuration parameters for multi-homing, it is useful to consider a few practical use cases for why it is desirable to have multi-homed servers and how Hadoop can utilize them. Here are four use cases that cover most of the concepts, although this is not an exhaustive list:
intra-cluster network separate from external client-server network
dedicated high-bandwidth network separate from general-use network
dedicated high-security or Hadoop administrator network separate from medium-security general-use network
sysop/non-Hadoop network separate from Hadoop network(s)
The independent variables in each use case are:
which network interfaces are available on servers and client hosts, within the cluster and external to the cluster; and
which interfaces are preferred for service/service and client/service communications for each service.
<Use cases to be expanded in a later version of this document.> Theory of Operations
In each use case, the site architects will determine which network interfaces are available on servers and client hosts, and which interfaces are preferred for service/service and client/service communications of various types. We must then provide configuration information that specifies for the cluster:
How clients will find the server hostname of each service
Whether each service will bind to one or all network interfaces on its server
How services will self-identify their server to Kerberos
Over time, the Hadoop community evolved an understanding of the configuration practices that make these multi-homing use cases manageable:
Configuration files should be shared among clients and services, as much as possible. This is highly desirable to minimize management overhead, and the desire to share common config files has guided a lot of the design choices for multi-homing support in Hadoop.
Names are important. In multi-homed cluster configuration parameters, servers shouldalways be identified or “addressed” by hostname, not IP address. [1] This enables the next several bullet points.
FQDNs are strongly preferred. It is recommended that all hostname strings be represented as Fully Qualified Domain Names, unless short hostnames are used throughout with rigorous consistency, for both clients and servers.
Each network should have DNS service available, with forward and reverse name lookup implemented “round-trip-able” for all Hadoop servers and client hosts expected to use that network. It is common and allowable for the same hostname to resolve to various of the host’s multiple IP addresses, depending on which DNS service (on which network) is used.
DNS services are a level of indirection for deriving both IP address and hostname info from the parameters specified in shared configuration files or from advertised values, to resolve to different networks for different querying processes. This allows, for example, external clients to find and talk to Datanodes on an external (client/server) network, while Datanodes find and talk to each other on an internal (intra-cluster) network.
Each server should have one consistent hostname on all interfaces. Theoretically, DNS allows a server to have a different hostname per network interface. But it is required [2] for multi-homed Hadoop clusters that each server have only one hostname, consistent among all network interfaces used by Hadoop. (Network interfaces excluded from use by Hadoop may allow other hostnames.)
If /etc/hosts files are used in lieu of DNS, then the above constraint to have the same hostname on all interfaces means that each /etc/hosts file can only specify one interface per server. Thus, clients at least are likely to have different /etc/hosts files than servers, and the allowable choices for use of the different networks may be constrained. If these constraints are not acceptable, provide DNS services.
Prior to 2014 there were a variety of parameters proposed for specific issues related to multi-homing. These were incremental band-aids with inconsistent naming and incomplete semantics. In mid-2014 (for hadoop-2.6.0) the Hadoop Core community resolved to address Multi-Homing completely, with consistent semantics and usages. Other components also evolved multi-homing support, with a variety of parameter name and semantics choices.
In general, for every service one must consider three sets of properties, related to the three key questions at the beginning of this section:
1. “Hostname Properties” relate to either specifying or deriving the server hostname by which clients will find each service, and the appropriate network for accessing it.
In some services (eg, HDFS Namenode and YARN ResourceManager) there may just be a parameter in a shared config file that hard-wires a specified server name. This approach is only practical for singleton services like masters, otherwise the config files would have to be edited differently on every node, instead of shared.
Many services (eg, Datanode, HBase services) use generic parameters or other means to perform a self-name-lookup via reverse-DNS, followed by registration (with Master or ZK) where the hostname gets advertised to clients.
Sometimes the hostname is neither specified nor advertised. Clients just need to know somehow, external to the config system, "based on human interaction".
Each client does a DNS lookup on the server hostname to get the IP address (and the associated network) on which to communicate with that service. So this client aspect of the network architecture decisions are expressed in the DNS server configuration rather than the Hadoop configuration info.
2. “Bind Address Properties” provides optional additional addressing information which is used only by the service itself, and only in multi-homed servers. By default, services will bind only to the default network interface
[3] on each server. The Bind Address Properties can cause the service to bind to a non-default network interface if desired, or most commonly the bind address can be set to “0.0.0.0”, causing the service to bind toall available network interfaces on its server.
The bind address usually takes the form of an IP address rather than a network interface name, to be type-compatible with the conventional “0.0.0.0” notation. However, some services allow specifying a network interface name.
The default interface choice, if bind address is unspecified, varies for different services. Choices include:
the default network interface
all interfaces
some other specific interface defined by other parameters.
3. “Kerberos Host Properties” provides the information needed to resolve the“_HOST” substitution in each service’s own host-dependent Kerberos principal name. This is separate from the “Hostname Property” because not all services advertise their hostname to clients, and/or may need to use a different means to determine hostname for security purposes.
Usually the Kerberos Host substitution name is derived from another self-name-lookup, similar to hostname from #1, but possibly on a different DNS service.
Sometimes the hostname info from #1 is used for Kerberos also.
Absent other information, the Kerberos Host will be assigned the default hostname from the default network interface, obtained from the Java call, InetAddress.getLocalHost().getCanonicalHostName()
Different services evolved different parameters, or combinations of parameters, and parameter naming conventions, but all services need to somehow present the above three property sets.
The “listener” port number for each service must also be managed, but this property does not change in single- vs multi-home clusters. So we will not discuss it further here, except to mention that in HDFS and YARN it is conjoined with the “address” parameters (Hostname Property), while in other services such as HBase, it is in separate “port” parameters. Parameters HDFS Service Parameters
For historical reasons, in HDFS and YARN most of the parameters related to Hostname Properties end in “-address” or “.address”. Indeed, in non-multi-homed clusters it is acceptable and was once common to simply use IP addresses rather than hostname strings. In multi-homed clusters, however, all server addressing must be hostname-based. In HDFS and YARN most of the parameters related to Bind Address Properties end in “-bind-host” or “.bind-host”.
The four Namenode services each have a
dfs.namenode.${service}-address parameter, where ${service} is rpc, servicerpc, http, or https. The Namenode server’s hostname should be assigned here, to establish the Hostname Property for each service. This information is provided to clients via shared config files.
For each, there is an optional corresponding
dfs.namenode.${service}-bind-host parameter to establish the Bind Address Properties. It may be set to the server’s IP address on a particular interface, or to “0.0.0.0” to cause the Namenode services to listen on all available network interfaces. If unspecified, the default interface will be used. These parameters are documented at: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html
dfs.namenode.rpc-bind-host -- controls binding for dfs.namenode.rpc-address
dfs.namenode.servicerpc-bind-host -- controls binding for dfs.namenode.servicerpc-address
dfs.namenode.http-bind-host -- controls binding for dfs.namenode.http-address
dfs.namenode.https-bind-host -- controls binding for dfs.namenode.https-address
The Datanodes also present various services; however, their Hostname Properties are established via self-name-lookup as follows:
If present, the hadoop.security.dns.interface and hadoop.security.dns.nameserver parameters are used in the same way as documented in the Security Parameters section below, to determine the Datanode’s hostname. (If the rDNS lookup fails, it will fall back to /etc/hosts.) These two parameters are relatively new, available in Apache Hadoop 2.7.2 and later, and back-ported to HDP-2.2.9 and later, and HDP-2.3.4 and later.
Otherwise if dfs.datanode.dns.interface and dfs.datanode.dns.nameserver parameters are present, they are used in the same way to determine the Datanode’s hostname. (If the rDNS lookup fails there, it willnot fallback to /etc/hosts.) This pair of parameters are (or will shortly be) deprecated, but many field installations currently use them.
If neither set of parameters is present, then the Datanode calls Java InetAddress.getLocalHost().getCanonicalHostName() to determine its hostname.
The Datanode registers its hostname with the Namenode, and the Namenode also captures the Datanode’s IP address from which it registers. (This will be an IP address on the network interface the Datanode determined it should use to communicate with the Namenode.) The Namenode then provides both hostname and IP address to clients, for particular Datanodes, whenever a client operation is needed. Which one the client uses is determined by the
dfs.client.use.datanode.hostname and dfs.datanode.use.datanode.hostname parameter values; see below.
Datanodes also have a set of
dfs.datanode.${service}-address parameters. However, confusingly, these relate only to the Bind Address Property, not the Hostname Property, for Datanodes. They default to “0.0.0.0”, and it is okay to leave them that way unless it is desired to bind to only a single network interface.
Both Namenodes and Datanodes resolve their Kerberos _HOST bindings by using the
hadoop.security.dns.interface and hadoop.security.dns.nameserver parameters, rather than their various *-address parameters. See below under “Security Parameters” for documentation. Clients probably need to use Hostnames when connecting to Datanodes
By defaultHDFS clients connect to Datanodes using an IP address provided by the Namenode. In a multi-homed network, depending on the network configuration, the Datanode IP address known to the Namenode may be unreachable by the clients. The fix is letting clients perform their own DNS resolution of the Datanode hostname. The following setting enables this behavior. The reason this is optional is to allow skipping DNS resolution for clients on networks where the feature is not required (and DNS might not be a reliable/low-latency service).
dfs.client.use.datanode.hostname should be set to true when this this behavior is desired. Datanodes may need to use Hostnames when connecting to other Datanodes:
More rarely, the Namenode-resolved IP address for a Datanode may be unreachable from other Datanodes. The fix is to force Datanodes to perform their own DNS resolution for inter-Datanode connections. The following setting enables this behavior.
dfs.datanode.use.datanode.hostname should be set to true when this this behavior is desired. Client interface specification
Usually only one network is used for external clients to connect to HDFS servers. However, if the client can “see” the HDFS servers on more than one interface, you can control which interface or interfaces are used for HDFS communication by setting this parameter:
<property>
<name>dfs.client.local.interfaces</name>
<description> A comma separated list of network interface names to use for data transfer
between the client and datanodes. When creating a connection to read from or
write to a datanode, the client chooses one of the specified interfaces at random
and binds its socket to the IP of that interface. Individual names may be
specified as either an interface name (eg "eth0"), a subinterface name (eg
"eth0:0"), or an IP address (which may be specified using CIDR notation to match
a range of IPs).
</description>
</property>
Yarn & MapReduce Service Parameters
The yarn and job history “*.address” parameters specify hostname to be used for each service and sub-service. This information is provided to clients via shared config files.
The optional yarn and job history “*.bind-host” parameters control bind address for sets of related sub-services as follows. Setting them to 0.0.0.0 causes the corresponding services to listen on all available network interfaces.
yarn.resourcemanager.bind-host -- controls binding for:
yarn.resourcemanager.address
yarn.resourcemanager.webapp.address
yarn.resourcemanager.webapp.https.address
yarn.resourcemanager.resource-tracker.address
yarn.resourcemanager.scheduler.address
yarn.resourcemanager.admin.address
yarn.nodemanager.bind-host-- controls binding for:
yarn.nodemanager.address
yarn.nodemanager.webapp.address
yarn.nodemanager.webapp.https.address
yarn.nodemanager.localizer.address
yarn.timeline-service.bind-host -- controls binding for:
yarn.timeline-service.address
yarn.timeline-service.webapp.address
yarn.timeline-service.webapp.https.address
mapreduce.jobhistory.bind-host -- controls binding for:
mapreduce.jobhistory.address
mapreduce.jobhistory.admin.address
mapreduce.jobhistory.webapp.address
mapreduce.jobhistory.webapp.https.address
Yarn and job history services resolve their Kerberos _HOST bindings by using the
hadoop.security.dns.interface and hadoop.security.dns.nameserver parameters, rather than their various *-address parameters. See below under “Security Parameters” for documentation. Security Parameters Parameters for _HOST resolution
Each secure host needs to be able to perform _HOST resolution within its own parameter values, to correctly manage Kerberos principals. For this it must determine its own name. By default it will use the canonical name, which is associated with its first network interface. If it is desired to use the hostname associated with a different interface, use these two parameters in core-site.xml.
Core Hadoop services perform a self-name-lookup using these parameters. Many other Hadoop-related services define their own means of determining the Kerberos host substitution name.
These two parameters are relatively new, available in Apache Hadoop 2.7.2 and later, and back-ported to HDP-2.2.9 and later, and HDP-2.3.4 and later. In previous versions of HDFS, the
dfs.datanode.dns.{interface,nameserver} parameter pair were used.
<property>
<name>hadoop.security.dns.interface</name>
<description>
The name of the Network Interface from which the service should determine
its host name for Kerberos login. e.g. “eth2”. In a multi-homed environment,
the setting can be used to affect the _HOST substitution in the service
Kerberos principal. If this configuration value is not set, the service
will use its default hostname as returned by
InetAddress.getLocalHost().getCanonicalHostName().
Most clusters will not require this setting.
</description>
</property>
<property>
<name>hadoop.security.dns.nameserver</name>
<description>
The hostname or IP address of the DNS name server which a service Node
should use to determine its own hostname for Kerberos Login, via a
reverse-DNS query on its own IP address associated with the interface
specified in hadoop.security.dns.interface.
Requires hadoop.security.dns.interface.
Most clusters will not require this setting.
</description>
</property>
Parameters for Security Token service host resolution
In secure multi-homed environments, the following parameter will need to be set to false (it is true by default) on both cluster servers and clients (see
HADOOP-7733), in core-site.xml. If it is not set correctly, the symptom will be inability to submit an application to YARN from an external client (with error “client host not a member of the Hadoop cluster”), or even from an in-cluster client if server failover occurs.
<property>
<name>hadoop.security.token.service.use_ip</name>
<description>
Controls whether tokens always use IP addresses. DNS changes will not
be detected if this option is enabled. Existing client connections that
break will always reconnect to the IP of the original host. New clients
will connect to the host's new IP but fail to locate a token.
Disabling this option will allow existing and new clients to detect
an IP change and continue to locate the new host's token.
Defaults to true, for efficiency in single-homed systems.
If setting to false, be sure to change in both clients and servers.
Most clusters will not require this setting.
</description>
</property>
Secure Hostname Lookup when /etc/hosts is used instead of DNS service
To prevent a security loophole described in
http://ietf.org/rfc/rfc1535.txt, the Hadoop Kerberos libraries in Secure environments with hadoop.security.token.service.use_ip set to false will sometimes do a DNS lookup on a special form of the hostname FQDN that has a “.” appended. For instance, instead of using the FQDN “host1.lab.mycompany.com” it will use the form “host1.lab.mycompany.com.”. [sic] This is called the “absolute rooted” form of the hostname, sometimes written FQDN+".". For an explanation of the need for this form, see the above referenced RFC 1535.
If DNS services are set up as required for multi-homed Hadoop clusters, no special additional setup is needed for this concern, because all modern DNS servers already understand the absolute rooted form of hostnames. However, /etc/hosts processing of name lookup does not. If DNS is not available and /etc/hosts is being used in lieu, then the following additional setup is required:
In the /etc/hosts file of all cluster servers and client hosts, for the entries related to all hostnames of cluster servers, add an additional alias consisting of the FQDN+".". For example, a line in /etc/hosts that would normally look like
39.0.64.1 selena1.labs.mycompany.com selena1 hadoopvm8-1
should be extended to look like
39.0.64.1 selena1.labs.mycompany.com selena1.labs.mycompany.com. selena1 hadoopvm8-1
If the absolute rooted hostname is not included in /etc/hosts when required, the symptom will be “Unknown Host” error (see
https://wiki.apache.org/hadoop/UnknownHost#UsingHostsFiles). HBase
The documentation of multi-homing related parameters in the HBase docs is largely lacking, and in some cases incorrect. The following is based on the most up-to-date information available as of January 2016.
Credit: Josh Elser worked out the correct information for this document, by studying the source code. Thanks, Josh! Service Parameters
Both HMaster and RegionServers advertise their hostname with the cluster’s Zookeeper (ZK) nodes. These ZK nodes are found using the
hbase.zookeeper.quorum parameter value. The ZK server names found here are looked up via DNS on the default interface, by all HBase ZK clients. (Note the DNS server on the default interface may return an address for the ZK node that is on some other network interface.)
The HMaster determines its IP address on the interface specified in the
hbase.master.dns.interface parameter, then does a reverse-DNS query on the DNS server at hbase.master.dns.interface. However, it does not use the resulting hostname directly. Instead, it does a forward-DNS lookup of this name on the default (or bind address) network interface, followed by a reverse-DNS lookup of the resulting IP address on the same interface’s DNS server. The resulting hostname is what the HMaster actually advertises with Zookeeper.
This sequence may be an attempt to correctly handle environments that don’t adhere to the “one consistent hostname on all interfaces” rule.
Regardless, if the “one consistent hostname on all interfaces” rule has been followed, it should be easy to predict the resulting Hostname Property value.
Each RegionServer obtains the HMaster’s hostname from Zookeeper, and looks it up with DNS on the default network. It then registers itself with the HMaster. When it does so, the HMaster captures the IP address of the RegionServer, and does a reverse-DNS lookup on the default (or HMaster bind address) interface to obtain a hostname for the RegionServer. It then delivers this hostname back to the RegionServer, which in turn advertises it on Zookeeper.
Thus, no config parameters are relevant to the RegionServer’s Hostname Property.
Again, it is important that each RegionServer’s hostname be the same on all interfaces if different clients are to use different interfaces, because the HMaster looks it up on only one.
Regardless of the above determination of hostname, the Bind Address Property may change the IP address binding of each service, to a non-default interface, or to all interfaces (0.0.0.0). The parameters for Bind Address Property are as follows, for each service:
HMaster RPC
hbase.master.ipc.address
HMaster WebUI
hbase.master.info.bindAddress
RegionServer RPC
hbase.regionserver.ipc.address
RegionServer WebUI
hbase.regionserver.info.bindAddress
REST RPC
hbase.rest.host
REST WebUI
hbase.rest.info.bindAddress
Thrift1 RPC
hbase.regionserver.thrift.ipaddress
Thrift2 RPC
hbase.thrift.info.bindAddress
Notes:
The WebUI services do not advertise their hostnames separately from the RPC services.
The REST and Thrift services do not advertise their hostnames at all; their clients must simply know “from human interaction” where to find the corresponding servers.
Finally, services and clients use a
dns.{interface,nameserver} parameter pair to determine their Kerberos Host substitution names, as follows. See hadoop.security.dns.interface and hadoop.security.dns.nameserver in the “Security Parameters” section for documentation of the semantics of these parameter pairs.
HMaster RPC
hbase.master.dns.{interface,nameserver}
HMaster WebUI
N/A
RegionServer RPC
hbase.regionserver.dns.{interface,nameserver}
RegionServer WebUI
N/A
REST RPC
N/A
REST WebUI
N/A
Thrift1 RPC
hbase.thrift.dns.{interface,nameserver}
Thrift2 RPC
hbase.thrift.dns.{interface,nameserver}
HBase Client
hbase.client.dns.{interface,nameserver}
Lastly, there is a bit of complexity with ZK node names, if the HBase-embedded ZK quorum is used. When HBase is instantiating an embedded set of ZK nodes, it is necessary to correlate the node hostnames with the integer indices in the “zoo.cfg” configuration. As part of this, ZK has a
hbase.zookeeper.dns.{interface,nameserver} parameter pair. Each ZK node uses these in the usual way to determine its own hostname, which is then fed into the embedded ZK configuration process. Superseded Parameters
Superseded by
hadoop.security.dns.interface and hadoop.security.dns.nameserver, in Apache Hadoop 2.7.2 and later (back-ported to HDP-2.2.9 and later, and HDP-2.3.4 and later):
dfs.datanode.dns.interface
dfs.datanode.dns.nameserver
Superceded by Namenode HA:
dfs.namenode.secondary.http-address and https-address
only used in non-HA systems that have two NN masters yet have not configured HA [4]
Superceded hdfs/yarn/mapreduce parameters used by older systems:
yarn.nodemanager.webapp.bind-host: "0.0.0.0",
yarn.nodemanager.webapp.https.bind-host: "0.0.0.0",
just use yarn.nodemanager.bind-host: "0.0.0.0"
yarn.resourcemanager.admin.bind-host: "0.0.0.0",
yarn.resourcemanager.resource-tracker.bind-host: "0.0.0.0",
yarn.resourcemanager.scheduler.bind-host: "0.0.0.0",
just use yarn.resourcemanager.bind-host: "0.0.0.0"
yarn.timeline-service.webapp.bind-host: "0.0.0.0",
yarn.timeline-service.webapp.https.bind-host: "0.0.0.0",
just use yarn.timeline-service.bind-host: "0.0.0.0"
mapreduce.jobhistory.admin.bind-host: "0.0.0.0",
mapreduce.jobhistory.webapp.bind-host: "0.0.0.0",
mapreduce.jobhistory.webapp.https.bind-host: "0.0.0.0",
just use mapreduce.jobhistory.bind-host: "0.0.0.0"
Older historical items superceded:
dfs.namenode.master.name
predecessor of namenode bind-host parameters, and/or rule of “one consistent hostname on all interfaces”.
dfs.datanode.hostname
predecessor of “one consistent hostname on all interfaces” rule.
Note
Regarding the blog at:
http://hortonworks.com/blog/multihoming-on-hadoop-yarn-clusters/
In “Taking a Test Drive” section of this blog it suggests using EC2 with Public and Private networks. We now know this doesn’t work, since by default in AWS, Public addresses are just NAT’ed to Private. Instead, one must use VPC capability with 2 Private networks. Should fix this document.
[1] Unless you really, really know what you’re doing, and want to work through every little nuance of the semantics of all the parameters. About the only situation in which you might get away with using IP addresses for server addressing is if ALL services are being bound to specific single interfaces, not using 0.0.0.0 at all.
[2] Having one consistent hostname on all interfaces is not a strict technical requirement of Hadoop. But it is a major simplifying assumption, without which the Hadoop administrator is likely to land in a morass of very complex and mutually conflicting parameter choices. For this reason, Hortonworks only supports using one consistent hostname per server on all interfaces used by Hadoop.
[3] In non-virtual Linux systems, the default network interface is typically “eth0” or “eth0:0”, but this conventional naming can be changed in system network configuration or with network virtualization.
[4] Not sure why the resources would be committed for a secondary NN, but not set up HA. However, this is still a supported configuration.
... View more
- Find more articles tagged with:
- Cloud & Operations
- FAQ
- network