About mfoley

mfoley · ‎03-01-2018

Hi @Zack Riesland , thanks for the clarification. I understand now that s3-dist-cp is used both coming and going, and I agree it doesn't seem to have any renaming capability built in. I strongly suspect there's a way to control the filenames by controlling the Hive partition name, but I'm not a Hive expert and maybe the application benefits from using the same partition name for raw and clean data. Here's what I found that might help: a StackOverflow query that confirms your complaint that CLI rename of HDFS files is slow, and confirms my sense that it shouldn't be: https://stackoverflow.com/questions/14736017/batch-rename-in-hadoopThe second responder (Robert) wrote a 1-page java program that uses the hadoop java api, and shows that it can rename several thousand files in the same 4 seconds that the CLI takes to rename one file. This suggests that the majority of time is taken up by the connection protocol, and the java program can use a single connection for multiple operations. I also looked at options for using Python similarly. It appears that the hdfs3 library would do quite nicely, although you have to separately install the underlying C library opensourced by Pivotal. Full docs for installation and api are given in the linked doc. Please note it probably would not work to use the similarly-named "hdfs" library, because the latter uses the WebHDFS interface, which is unlikely to be faster than the CLI (although I didn't test it). Since I'm a Hortonworks employee, I should point out that the above opensource references are not supported or endorsed by the company, and are used at your own risk. I did look over the Java program and it looks fine to me, but that's my personal opinion, not the company's. Good luck. If that doesn't get you where you need to be, and since your profile popup indicates you are a Hortonworks support customer, you'd be welcome to open a support request for advice from an SE.

mfoley · ‎02-28-2018

@zackriesland , you don't say what application you're using to pull and clean the new data on HDFS. The name pattern "data_day=01-01-2017/000000_0" is under control of that application, and can be configured there. It's not an HDFS thing; HDFS is just a file system, and stores files under whatever filenames the app tells it to use. If it is not acceptable to change the naming configuration in that app, there are other possible approaches. You say that renaming the files takes 3 seconds per chunk file. Am I correct in assuming that is using S3 commands to rename the file after uploading to S3? If so, you could try renaming the files on HDFS before uploading to S3. You can use the "hadoop fs -mv" CLI command, very much like the linux "mv" command, to rename files on HDFS from a script, and it should not take any 3 seconds. You can use this command in a shellscript while logged into any server in the HDFS cluster. Another way to run such a script is on a workstation with, for instance, the Python HDFS Client library installed, and configured to be able to talk to the HDFS cluster. Finally, you're probably aware that filenames in S3 are actually just strings, and the slash ("/") to delimit "folders" is purely convention, except for the delimiter between the bucket name and the rest of the filename. There aren't actually any folders below the level of the bucket. So changing the "dest" location for the s3-dist-cp command to put it in a sub-folder is really just the same as specifying a prefix (the name of the "sub-folder") on the filename. Use the date, or a serial number, or whatever convention you want so each invocation of s3-dist-cp puts its files on S3 with a different prefix. It seems that would also fulfill the requirement above. Hope one or another of these ideas helps.

mfoley · ‎01-31-2018

@Michael Bronson - greetings to a fellow Simpsonified! If the Ambari server is down, then there are obviously problems with the system, and it is a little tricky to say what to do -- A complete answer would look more like a decision tree than a single response. Nevertheless, I'll try to provide some help, with the understanding that a complete answer is beyond the scope of a simple Q&A. First off, why not restart the Ambari service? That's by far the simplest answer. Give it a few minutes to check its connectivity with the other services, and you should be able to proceed via Ambari. Really, this is the only good solution. If you really need to do it the hard way, there are two basic choices: (a) If you know a little about each of the services, you can use service-specific CLIs on each of the masters to check status and/or stop each service. (b) Alternatively, you can use the fact that these services essentially all run in Java and are typically installed on "hdp" paths in the filesystem, to use linux (or OS equivalent) commands to find and kill them. Option (a) requires a little research, and depends on which services are running in the cluster. For instance, you can log into the HDFS Master and use the commands summarized here: https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/ClusterSetup.html#Hadoop_Shutdown to shutdown the core Hadoop services. Each of the other major components have similar CLIs, which some diligent websearching on the Apache sites will find. They allow querying the current status of the services, and provide reasonably clean shutdowns. The final option assumes you are a capable Linux administrator, as well as understanding how HDP was installed. If not, you shouldn't try this. Option (b) is rude, crude, and depends overmuch on the famed crash-resilience of Hadoop components. Do NOT take this answer as Hortonworks advice -- to the contrary, it is the equivalent of pulling the plug on your servers, something you should only do on a production system if there are no other alternatives. But the fact is, if you have admin privileges, doing `ps auxwww|grep java|grep -i hdp` (assuming all services have been installed on paths that include the word 'hdp' or 'HDP') on each host in the cluster should elucidate all HDP-related processes still running (and maybe some others; do check the results carefully). If you see fit to kill them ... that's your responsibility. Very important if at all possible to quiesce the cluster first, at least by stopping data inputs and waiting a few minutes. Note the Ambari Agent is typically installed as a service with auto-restart; it is resilient and stateless, so not necessary to stop it before rebooting the server, but you could run `ambari-agent stop` (on each server) to make sure it stays out of the way while you work on the other services. Rebooting the server should restart it too. Hope this helps.

mfoley · ‎11-30-2017

There's a third option: You could use the InvokeHTTP processor. In it, use the Expression Language to construct the RemoteURL, to directly build an AWS S3 web API call, to get the listing you need. AWS web API is documented here: http://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html However, as that document says, "Making REST API calls directly from your code can be cumbersome. It requires you to write the necessary code to calculate a valid signature to authenticate your requests. We recommend the following alternatives instead: \[AWS SDK or AWS CLI\]" -- which are the approaches I gave above.

mfoley · ‎11-30-2017

Hi @Identity One , You're right, it's unfortunate the ListS3 processor doesn't accept an incoming flowfile to provide a prefix. You might want to enter an Apache jira (https://issues.apache.org/jira/projects/NIFI : "Create" button) requesting such a feature. In the meantime, you can use the ExecuteScript processor to construct and run an appropriate s3 list command, with prefix, in response to incoming flowfiles. You can use any of the ExecuteScript allowed scripting languages that have a corresponding AWS SDK client library; see https://aws.amazon.com/tools/#sdk -- looks like the choices are python, ruby, or js. If you click through any of those links in the aws sdk page, under "Documentation" : "Code Examples" : "Amazon S3 Examples", it should help. For example, the Python S3 examples are here: https://boto3.readthedocs.io/en/latest/guide/s3-examples.html . The "list" api is here: https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2 If you're not into scripting to the SDK, you could use your favorite scripting language's escape to shell (like Python's 'subprocess' feature), and invoke the AWS S3 command-line-interface commands (which you're evidently already familiar with). It's crude (and slower) but it would work 🙂 Hope this helps.

mfoley · ‎03-01-2017

Hi @Sedat Kestepe, regarding "third party and custom apps", this is of course outside the scope of my expertise. But I think that, yes, you will have to find out (a) whether they are even capable of being configured to use non-default network; (b) if so, how to so configure them; and (c) if desired, whether they can use the "0.0.0.0" config, meaning "listen on all networks" -- which is often a cheap way to get desirable results, altho in some environments it has security implications. Your second question is "How is service bound to preferred interface." Let's be clear that there are two viewpoints, the service and the client. From the service's viewpoint, "binding interface" really means, "what ip address(es) and port number shall I LISTEN on for connection requests?" Regrettably there is no uniformity in how each application answers that question, even among the Hadoop Stack components -- else that article could have been about half the length! Some applications aren't configurable and always use the host's default ip address, and a fixed port number. Sometimes the port number is just the starting point, where the app and client negotiate yet another port number for each connection. Sometimes a list of ip addresses can be configured, to listen on a specific subset of networks available on multi-homed servers. Sometimes the "magic" address "0.0.0.0" is understood to mean "listen on ALL available network ip addresses". (This concept is integrated with most tcp/ip libraries that applications might use.) Almost always, though, a single application port number is used by a given service, which means only one instance of the service per host (although it can be multi-threaded if it uses that port negotiation trick). How the ip address(es) and port number are specified, in one or several config parameters, is up to the application. As noted in the article, sometimes Hadoop Stack services even use their own hostname and run it through DNS to obtain an ip address; the result is dependent on which DNS service is queried, which in turn may be configured by yet another parameter (see for example the Kerberos _HOST resolution feature). From the client's viewpoint, it must answer the question "where do I FIND that service?" Clients are usually much simpler and are configured with a single ip address and port number where it should attempt to contact a service. That isn't necessarily so; clients often have lists of servers that they try to connect to in sequence, so they could have multiple ip addresses for the same server. But usually we prefer to put complexity in services and keep clients as simple as possible, because clients usually run in less controlled environments than services. It never makes sense to hand "0.0.0.0" to a client, because the client is initiating, not listening. The client has to know SPECIFICALLY where to find the service, and it is typically (tho not necessarily) different than the client's host server. We encourage use of a hostname rather than ip address for this purpose, at which point the client will obtain the ip address from its default DNS server, or (in some cases) a configuration-specified DNS server. Regarding your third question, in the "Theory of Operations" section, items 5, 6, and 7 should be taken together. #5 is an example of why it is useful and even necessary to have the same hostname on all networks: "external clients [can] find and talk to Datanodes on an external (client/server) network, while Datanodes find and talk to each other on an internal (intra-cluster) network." If the particular Datanode has a different hostname on the two networks, all parties become confused, because they may tell each other to talk to a different Datanode by name. #6 actually gives you an escape hatch: "(Network interfaces excluded from use by Hadoop may allow other hostnames.)" The "one hostname for all interfaces" only applies to the interfaces used for the Stack. An interface used only by administrators, for example, not for submitting or processing any Hadoop jobs, could assign the server a different hostname on that network. But if you violate #5, you threaten the stability of your configuration. #7 describes precisely the constraints within which you can use /etc/hosts files instead of DNS. It should perhaps mention that, for example, it would work to have the cluster members know their peers only via one network, and know its clients via one other network. The clients would know the cluster members by the second network, with the SAME hostnames, and that's okay. As long as clients and servers have different /etc/host files, and you use hostnames not IP addresses for configurations, it can often work. And as #7 points out, if these constraints don't work for your needs, it isn't that hard to add DNS. The example you gave is a possible result, yes. More broadly, all the servers would share /etc/hosts files in which all the servers are identified with an ip address on the intra-cluster network: namenode2 192.168.1.2 datanode3 192.168.1.3 datanode4 192.168.1.4 datanode5 192.168.1.5 # etc. while the clients would share /etc/hosts files in which those servers are identified with an ip address on the client/server access network: namenode2 10.0.9.2 datanode3 10.0.9.3 datanode4 10.0.9.4 datanode5 10.0.9.5 # etc. And this is a good example of why you MUST use hostnames not ip addresses, and dfs.client.use.datanode.hostname must be set true, because the clients and servers must be able to specify cluster servers to each other, and the ip addresses won't be mutually accessible.

mfoley · ‎02-27-2017

@Sedat Kestepe , Network bonding is done at the OS level, before Hadoop or other Application-level software knows anything about it. So if you have a properly configured bonded network, you can configure it into Hadoop just like any other network. The only possibly tricky thing depends on whether the bonded network is the ONLY network available on the server (which therefore means it is the default network), or if there are other networks also available in the server, and maybe some other network is taking the "eth0" default designation. If so, you can either force the bonded network to be the default network (how you do that is dependent on the OS, the NIC, and the BIOS, so I won't try to advise; please consult your IT person), or you can configure Hadoop to use the non-default network. For information on how to configure a non-default network, you might want to read the article at https://community.hortonworks.com/content/kbentry/24277/parameters-for-multi-homing.html This article is for "multi-homed" servers (servers with multiple networks) in which you want to use more than one of the networks. But even if you only want to use ONE of the networks (in this case, your bonded network), if it is not the default network then you still need to understand the network configuration parameters in Hadoop, etc. To get the name and ip addresses of all networks on the server, give the command `ifconfig` while logged in on the server in question. Hope this helps. If you still have questions after reading the article, please ask follow-up questions here, and include the `ifconfig` output unless it's not relevant.

mfoley · ‎12-16-2016

Hi @Arsalan Siddiqi , yes, this is what I expected for Putty. By using the Putty configuration dialogue as shown in your screenshot, you ALREADY have an ssh connection to the guest VM, so typing `ssh` in the Putty terminal window is unnecessary and `scp` won't work -- instead of asking the host OS to talk to the guest OS, you would be asking the guest OS to talk to itself! I'm glad `pscp` (in a cmd window) worked for you -- and thanks for accepting the answer. Your last question, "why when i connect via Putty to root@127.0.0.1 p 2222 in windows and do a "dir" command, i see different folders as to when i do a DIR in virtualbox terminal??" is a puzzler. I would suggest, in each window (Putty terminal and Virtualbox terminal), you type the linux command `pwd`, which stands for "print working directory". It may be that the filesystem location you are ending up at login in the two tools is different. Also you can try the linux `ls` command.

mfoley · ‎12-16-2016

(Got this wrong the first time! The following should be correct.) BTW, if your clusters have been inadvertently configured for Federation, and you want to disentangle them, it can be a little tricky. You can't just suddenly change all their configurations, because that would cause the namenodes to lose track of blocks on the remote datanodes. The needed process is too complex to detail here, but basically you need to do something like this: The test for whether this problem exists is to open each NameNode's web UI ( http://<NameNode_FQDN>:50070 ) and navigate to the DataNodes page, then see if datanodes from the OTHER cluster are shown. If so, the two clusters have Federated. To separate them, first, change the configuration of each Namenode so that it stops serving namespaces from both clusters. I believe it would be best to reconfigure both Namenodes in each HA grouping at the same time, so a brief downtime will be needed. Then decommission each of the datanodes individually, from all clusters/namenodes, a few at a time (where "few" means a small percentage of the overall cluster size), so they flush their blocks to other nodes. Then, after each datanode is no longer referenced by any namespace, you change ITS configuration to only register to the Namenodes/Nameservices on its own cluster. Restart it, join it to its cluster, and continue with the next datanode. Alternate between the datanodes on the two clusters, so you don't run into storage space issues. Continue until all the datanodes have been so "gently" reconfigured to a single cluster's namespace(s). At the end of the process, all the nodes in each cluster will have a non-Federated config. This can be done without data loss and only brief NN downtime, but it will require significant time commitment, depending on the cluster sizes. Please don't try this unless you thoroughly understand the issues involved. If you have any doubts, consulting assistance may be appropriate.

mfoley · ‎12-16-2016

Hi @jbarnett, this was really interesting to think about. The reason is that your question crosses the domains of three HDFS features: HA, Federation, and Client ViewFS. The key understanding required is that Clusters and Clients can, and sometimes should, be configured differently. The blog you reference specifically says that the configuration he proposes is "a custom configuration you are only going to use for distcp" (my emphasis added), and he specifically proposes it as a solution for doing distcp between HA clusters, where at any given time different namenodes may be the "active" namenode within each cluster that you're distcp-ing between.[1] He does not mention a critical point, which is that this configuration is for the Client only, not for the HDFS Cluster nodes (and the Client host where this is configured must not be one of the HDFS node servers), if you wish the two clusters to remain non-Federated. The behavior you describe for rebalancing would be the expected result if this configuration were also used for the cluster nodes themselves, causing them to Federate -- and in this case, yes, the undesired result would be balancing across the two clusters regardless of their membership. It is also probable, although I'm not certain, that if you give the `hdfs balancer` command from a Client with this configuration, that it will initiate rebalancing on both clusters; whether each cluster rebalances only within itself or across cluster boundaries I'm not 100% sure, but since the balancing is delegated to the namenodes, I would expect that if the clusters are each using a non-Federated configuration, then they would balance only within themselves. So the simple answer to your question is: To keep the two clusters separate and non-Federated from each other, the nodes of each cluster should be configured with the hdfs-site.xml file appropriate to its own cluster, and not mentioning the other cluster. Since each cluster is itself HA, it will still need to use the "nameservice suffix" form of many of the configuration parameters, BUT it will only mention the nameservice(s) and namenodes hosted in that cluster; it will not mention nameservice(s)/namenode(s) hosted in the other cluster. (And, btw, the nameservices in the two clusters should have different names -- "fooSite1" and "fooSite2" -- even if they are intended to be mirrors of each other.) Your client hosts should be separate from your cluster nodes, and the two-cluster ViewFS Client configuration you use for invoking `distcp` should be different from the single-cluster ViewFS Client configurations you use for invoking `balancer` on each individual cluster, if you want to be able to trigger rebalancing separately. Make sense? To get a clear understanding of these features (HA, Federation, and Client ViewFS), and how they relate to each other, I encourage you to read the authoritative sources at: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/Federation.html https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/ViewFs.html Hope this helps. ------------------- [1] He doesn't seem to include the final command which is the whole point of his blog: That after configuring the Client for a ViewFS view of the two clusters, you can now give the `distcp` command in the form: hdfs distcp viewfs://serviceId1/path viewfs://serviceId2/path and not have to worry about which namenode is active in each cluster.

Online	Offline
Last Visited	‎08-14-2019 06:45 PM

Member Since	‎10-22-2015 04:49 PM
Last Visited	‎08-14-2019 06:45 PM
Posts	83
Kudos received	79

Cloudera Community

Re: how to check all component on master are stop ...

Re: Network Bonding on Hadoop Cluster with Centos ...

Re: With two HA clusters configured for cross-clus...

Re: file location in HDP

Re: Is it possible to change the default "home" di...

Re: Can I control naming patterns for HDFS chunks

Re: Can I control naming patterns for HDFS chunks

Re: how to check all component on master are stop ...

Re: How to list from S3 recursively on a prefix in...

Re: How to list from S3 recursively on a prefix in...

Re: Network Bonding on Hadoop Cluster with Centos ...

Re: Network Bonding on Hadoop Cluster with Centos ...

Re: file location in HDP

Re: With two HA clusters configured for cross-clus...

Re: With two HA clusters configured for cross-clus...