Member since
10-22-2015
83
Posts
84
Kudos Received
13
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
358 | 01-31-2018 07:47 PM | |
743 | 02-27-2017 07:26 PM | |
539 | 12-16-2016 07:56 PM | |
2461 | 12-14-2016 07:26 PM | |
1153 | 12-13-2016 06:39 PM |
03-01-2018
07:56 PM
2 Kudos
Hi @Zack Riesland , thanks for the clarification. I understand now that s3-dist-cp is used both coming and going, and I agree it doesn't seem to have any renaming capability built in. I strongly suspect there's a way to control the filenames by controlling the Hive partition name, but I'm not a Hive expert and maybe the application benefits from using the same partition name for raw and clean data. Here's what I found that might help: a StackOverflow query that confirms your complaint that CLI rename of HDFS files is slow, and confirms my sense that it shouldn't be: https://stackoverflow.com/questions/14736017/batch-rename-in-hadoopThe second responder (Robert) wrote a 1-page java program that uses the hadoop java api, and shows that it can rename several thousand files in the same 4 seconds that the CLI takes to rename one file. This suggests that the majority of time is taken up by the connection protocol, and the java program can use a single connection for multiple operations. I also looked at options for using Python similarly. It appears that the hdfs3 library would do quite nicely, although you have to separately install the underlying C library opensourced by Pivotal. Full docs for installation and api are given in the linked doc. Please note it probably would not work to use the similarly-named "hdfs" library, because the latter uses the WebHDFS interface, which is unlikely to be faster than the CLI (although I didn't test it). Since I'm a Hortonworks employee, I should point out that the above opensource references are not supported or endorsed by the company, and are used at your own risk. I did look over the Java program and it looks fine to me, but that's my personal opinion, not the company's. Good luck. If that doesn't get you where you need to be, and since your profile popup indicates you are a Hortonworks support customer, you'd be welcome to open a support request for advice from an SE.
... View more
02-28-2018
08:05 PM
@zackriesland , you don't say what application you're using to pull and clean the new data on HDFS. The name pattern "data_day=01-01-2017/000000_0" is under control of that application, and can be configured there. It's not an HDFS thing; HDFS is just a file system, and stores files under whatever filenames the app tells it to use. If it is not acceptable to change the naming configuration in that app, there are other possible approaches. You say that renaming the files takes 3 seconds per chunk file. Am I correct in assuming that is using S3 commands to rename the file after uploading to S3? If so, you could try renaming the files on HDFS before uploading to S3. You can use the "hadoop fs -mv" CLI command, very much like the linux "mv" command, to rename files on HDFS from a script, and it should not take any 3 seconds. You can use this command in a shellscript while logged into any server in the HDFS cluster. Another way to run such a script is on a workstation with, for instance, the Python HDFS Client library installed, and configured to be able to talk to the HDFS cluster. Finally, you're probably aware that filenames in S3 are actually just strings, and the slash ("/") to delimit "folders" is purely convention, except for the delimiter between the bucket name and the rest of the filename. There aren't actually any folders below the level of the bucket. So changing the "dest" location for the s3-dist-cp command to put it in a sub-folder is really just the same as specifying a prefix (the name of the "sub-folder") on the filename. Use the date, or a serial number, or whatever convention you want so each invocation of s3-dist-cp puts its files on S3 with a different prefix. It seems that would also fulfill the requirement above. Hope one or another of these ideas helps.
... View more
01-31-2018
07:47 PM
1 Kudo
@Michael Bronson - greetings to a fellow Simpsonified! If the Ambari server is down, then there are obviously problems with the system, and it is a little tricky to say what to do -- A complete answer would look more like a decision tree than a single response. Nevertheless, I'll try to provide some help, with the understanding that a complete answer is beyond the scope of a simple Q&A. First off, why not restart the Ambari service? That's by far the simplest answer. Give it a few minutes to check its connectivity with the other services, and you should be able to proceed via Ambari. Really, this is the only good solution. If you really need to do it the hard way, there are two basic choices: (a) If you know a little about each of the services, you can use service-specific CLIs on each of the masters to check status and/or stop each service. (b) Alternatively, you can use the fact that these services essentially all run in Java and are typically installed on "hdp" paths in the filesystem, to use linux (or OS equivalent) commands to find and kill them. Option (a) requires a little research, and depends on which services are running in the cluster. For instance, you can log into the HDFS Master and use the commands summarized here: https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/ClusterSetup.html#Hadoop_Shutdown to shutdown the core Hadoop services. Each of the other major components have similar CLIs, which some diligent websearching on the Apache sites will find. They allow querying the current status of the services, and provide reasonably clean shutdowns. The final option assumes you are a capable Linux administrator, as well as understanding how HDP was installed. If not, you shouldn't try this. Option (b) is rude, crude, and depends overmuch on the famed crash-resilience of Hadoop components. Do NOT take this answer as Hortonworks advice -- to the contrary, it is the equivalent of pulling the plug on your servers, something you should only do on a production system if there are no other alternatives. But the fact is, if you have admin privileges, doing `ps auxwww|grep java|grep -i hdp` (assuming all services have been installed on paths that include the word 'hdp' or 'HDP') on each host in the cluster should elucidate all HDP-related processes still running (and maybe some others; do check the results carefully). If you see fit to kill them ... that's your responsibility. Very important if at all possible to quiesce the cluster first, at least by stopping data inputs and waiting a few minutes. Note the Ambari Agent is typically installed as a service with auto-restart; it is resilient and stateless, so not necessary to stop it before rebooting the server, but you could run `ambari-agent stop` (on each server) to make sure it stays out of the way while you work on the other services. Rebooting the server should restart it too. Hope this helps.
... View more
11-30-2017
08:45 PM
There's a third option: You could use the InvokeHTTP processor. In it, use the Expression Language to construct the RemoteURL, to directly build an AWS S3 web API call, to get the listing you need. AWS web API is documented here: http://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html However, as that document says, "Making REST API calls directly from your code can be cumbersome. It requires you to write the necessary code to calculate a valid signature to authenticate your requests. We recommend the following alternatives instead: \[AWS SDK or AWS CLI\]" -- which are the approaches I gave above.
... View more
11-30-2017
08:30 PM
1 Kudo
Hi @Identity One , You're right, it's unfortunate the ListS3 processor doesn't accept an incoming flowfile to provide a prefix. You might want to enter an Apache jira (https://issues.apache.org/jira/projects/NIFI : "Create" button) requesting such a feature. In the meantime, you can use the ExecuteScript processor to construct and run an appropriate s3 list command, with prefix, in response to incoming flowfiles. You can use any of the ExecuteScript allowed scripting languages that have a corresponding AWS SDK client library; see https://aws.amazon.com/tools/#sdk -- looks like the choices are python, ruby, or js. If you click through any of those links in the aws sdk page, under "Documentation" : "Code Examples" : "Amazon S3 Examples", it should help. For example, the Python S3 examples are here: https://boto3.readthedocs.io/en/latest/guide/s3-examples.html . The "list" api is here: https://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.list_objects_v2 If you're not into scripting to the SDK, you could use your favorite scripting language's escape to shell (like Python's 'subprocess' feature), and invoke the AWS S3 command-line-interface commands (which you're evidently already familiar with). It's crude (and slower) but it would work 🙂 Hope this helps.
... View more
09-13-2017
07:14 PM
The 7.2 instructions should work fine on 7.3.
... View more
08-09-2017
07:43 PM
@R Patel , if the above is a satisfactory answer, please accept it? Thanks.
... View more
08-07-2017
06:36 PM
@R Patel , responding to your follow-up question about adding a Java parser: When METRON-777 is added to Metron, which I expect will be within this week, there will be a true plug-in architecture for new parsers, along with a Maven Archetype and documentation for creating new parsers. Because it's coming very soon and will completely change the answer to your question, I encourage you to wait for it. However, for the sake of completeness, here is a brief answer for the current state of the Metron code: In the absence of a true plugin architecture, you simply have to add your new parser to the other Metron parsers. 1. Start by using git to clone the Metron codebase from https://github.com/apache/metron/ 2. In your favorite IDE, navigate to metron/metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/ , and you will see that each of the existing parsers is a self-contained java package. Add a new package titled with the name of your parser. 3. The `bro` parser is a fairly simple example to follow. It demonstrates that your parser should extend `BasicParser` and provide implementations of its abstract methods. 4. However, the `bro` parser has a null configure method. Working in depth with Metron's configuration capability is too complex to include as part of this answer, but the `csv` parser provides a simple configuration method, and the `snort` processor a more sophisticated (but still follow-able) example of configuration. The configure() method is invoked automatically by the infrastructure at instance startup, and provides the opportunity for the parser instance to read its config. 5. By "your json file" I assume you mean the configuration. In general, rather than placing it in the code base, you would add it to the Zookeeper configuration for Metron, under the PARSER ConfigurationType, by using the same configuration tools used to add your custom config to the grok parser, but under the "sensor name" of your new parser. That should be enough to give you a start. However, I reiterate that by the time you get well into such an implementation, METRON-777 should be available and you would want to switch gears to the plugin model. Hope this helps.
... View more
08-01-2017
06:09 PM
Hi @R Patel, Sebastien provided one of my favorite three documents on this topic. The other two are: https://cwiki.apache.org/confluence/display/METRON/2016/04/25/Metron+Tutorial+-+Fundamentals+Part+1%3A+Creating+a+New+Telemetry Similar to the previous citation, it is an end-to-end example using Squid logs as the example data, and creating a Grok parser to parse them. https://metron.apache.org/current-book/metron-platform/metron-parsers/index.html Provides much more detail about how the parser fits into the overall Metron infrastructure, especially configuration, and a little guidance about Java parsers. In a future release, Parsers will be a plug-in, with a maven archtype to help create new ones (Apache Jira METRON-777). So keep an eye out for that. Also you may want to join the Apache Metron user or dev mailing lists. See http://metron.apache.org/community/
... View more
03-21-2017
06:47 PM
@wzheng , This sounds like a question preparatory to using it in production. The sandbox really is mostly set up for single-node single-user exploration. However, our cloud offerings are intended for production use, and are available for HDP on your choice of AWS, Azure, and GCP; see https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-1.6.3/index.html
... View more
03-13-2017
06:21 PM
2 Kudos
@Derrick Lin Please take a look at article https://community.hortonworks.com/content/kbentry/24277/parameters-for-multi-homing.html There is more to multi-home configuration than is in the HDFS document, and the more complete discussion may help you resolve your problem, especially wrt DNS and naming. Biggest question being, do the cluster hosts have the same name on all networks? Hope this helps.
... View more
03-10-2017
12:20 AM
@dbalasundaran , Part of the point of using Ambari is so you don't need to worry about which files a client needs, which may change from release to release. The GUI allows you to specify where the downloaded tarball goes, and both Ambari and the tar extraction command will complain if the tarball file is corrupted or incomplete. So you should be able to use it with confidence. Of course it's not beyond possibility that the Ambari specification could be buggy, so if you think you're seeing a wrong result in your client download, by all means bring it up. Have you seen a problem that makes you concerned? Or are you just trying to be thorough? Thanks.
... View more
03-01-2017
10:30 PM
1 Kudo
Hi @Sedat Kestepe, regarding "third party and custom apps", this is of course outside the scope of my expertise. But I think that, yes, you will have to find out (a) whether they are even capable of being configured to use non-default network; (b) if so, how to so configure them; and (c) if desired, whether they can use the "0.0.0.0" config, meaning "listen on all networks" -- which is often a cheap way to get desirable results, altho in some environments it has security implications. Your second question is "How is service bound to preferred interface." Let's be clear that there are two viewpoints, the service and the client. From the service's viewpoint, "binding interface" really means, "what ip address(es) and port number shall I LISTEN on for connection requests?" Regrettably there is no uniformity in how each application answers that question, even among the Hadoop Stack components -- else that article could have been about half the length! Some applications aren't configurable and always use the host's default ip address, and a fixed port number. Sometimes the port number is just the starting point, where the app and client negotiate yet another port number for each connection. Sometimes a list of ip addresses can be configured, to listen on a specific subset of networks available on multi-homed servers. Sometimes the "magic" address "0.0.0.0" is understood to mean "listen on ALL available network ip addresses". (This concept is integrated with most tcp/ip libraries that applications might use.) Almost always, though, a single application port number is used by a given service, which means only one instance of the service per host (although it can be multi-threaded if it uses that port negotiation trick). How the ip address(es) and port number are specified, in one or several config parameters, is up to the application. As noted in the article, sometimes Hadoop Stack services even use their own hostname and run it through DNS to obtain an ip address; the result is dependent on which DNS service is queried, which in turn may be configured by yet another parameter (see for example the Kerberos _HOST resolution feature). From the client's viewpoint, it must answer the question "where do I FIND that service?" Clients are usually much simpler and are configured with a single ip address and port number where it should attempt to contact a service. That isn't necessarily so; clients often have lists of servers that they try to connect to in sequence, so they could have multiple ip addresses for the same server. But usually we prefer to put complexity in services and keep clients as simple as possible, because clients usually run in less controlled environments than services. It never makes sense to hand "0.0.0.0" to a client, because the client is initiating, not listening. The client has to know SPECIFICALLY where to find the service, and it is typically (tho not necessarily) different than the client's host server. We encourage use of a hostname rather than ip address for this purpose, at which point the client will obtain the ip address from its default DNS server, or (in some cases) a configuration-specified DNS server. Regarding your third question, in the "Theory of Operations" section, items 5, 6, and 7 should be taken together. #5 is an example of why it is useful and even necessary to have the same hostname on all networks: "external clients [can] find and talk to Datanodes on an external (client/server) network, while Datanodes find and talk to each other on an internal (intra-cluster) network." If the particular Datanode has a different hostname on the two networks, all parties become confused, because they may tell each other to talk to a different Datanode by name. #6 actually gives you an escape hatch: "(Network interfaces excluded from use by Hadoop may allow other hostnames.)" The "one hostname for all interfaces" only applies to the interfaces used for the Stack. An interface used only by administrators, for example, not for submitting or processing any Hadoop jobs, could assign the server a different hostname on that network. But if you violate #5, you threaten the stability of your configuration. #7 describes precisely the constraints within which you can use /etc/hosts files instead of DNS. It should perhaps mention that, for example, it would work to have the cluster members know their peers only via one network, and know its clients via one other network. The clients would know the cluster members by the second network, with the SAME hostnames, and that's okay. As long as clients and servers have different /etc/host files, and you use hostnames not IP addresses for configurations, it can often work. And as #7 points out, if these constraints don't work for your needs, it isn't that hard to add DNS. The example you gave is a possible result, yes. More broadly, all the servers would share /etc/hosts files in which all the servers are identified with an ip address on the intra-cluster network:
namenode2 192.168.1.2 datanode3 192.168.1.3 datanode4 192.168.1.4 datanode5 192.168.1.5 # etc. while the clients would share /etc/hosts files in which those servers are identified with an ip address on the client/server access network:
namenode2 10.0.9.2 datanode3 10.0.9.3 datanode4 10.0.9.4 datanode5 10.0.9.5 # etc. And this is a good example of why you MUST use hostnames not ip addresses, and dfs.client.use.datanode.hostname must be set true, because the clients and servers must be able to specify cluster servers to each other, and the ip addresses won't be mutually accessible.
... View more
02-27-2017
07:26 PM
2 Kudos
@Sedat Kestepe , Network bonding is done at the OS level, before Hadoop or other Application-level software knows anything about it. So if you have a properly configured bonded network, you can configure it into Hadoop just like any other network. The only possibly tricky thing depends on whether the bonded network is the ONLY network available on the server (which therefore means it is the default network), or if there are other networks also available in the server, and maybe some other network is taking the "eth0" default designation. If so, you can either force the bonded network to be the default network (how you do that is dependent on the OS, the NIC, and the BIOS, so I won't try to advise; please consult your IT person), or you can configure Hadoop to use the non-default network. For information on how to configure a non-default network, you might want to read the article at https://community.hortonworks.com/content/kbentry/24277/parameters-for-multi-homing.html This article is for "multi-homed" servers (servers with multiple networks) in which you want to use more than one of the networks. But even if you only want to use ONE of the networks (in this case, your bonded network), if it is not the default network then you still need to understand the network configuration parameters in Hadoop, etc. To get the name and ip addresses of all networks on the server, give the command `ifconfig` while logged in on the server in question. Hope this helps. If you still have questions after reading the article, please ask follow-up questions here, and include the `ifconfig` output unless it's not relevant.
... View more
01-03-2017
06:34 PM
I just noticed your reference to article 40293, which correctly states that file and block deletion is suspended during an un-finalized upgrade. This will only cause a large impact on storage space, if your cluster usage involves a lot of file deletes. If your HDP cluster is used in a "data lake" model, where data is seldom deleted, then this is a non-issue. However, if your cluster is used in a mode where new data comes in and old data is regularly purged, then indeed the failure to actually delete the old data could cause an accumulation of unexpected usage.
... View more
01-03-2017
06:26 PM
Hi, @Joshua Adeleke . What is your total disk space on the cluster, and what was the disk usage BEFORE the upgrade? Did you do a pre-upgrade listing of `hdfs dfs -ls /` and can you compare that to what you have today? Also, in your question 74893 which you reference, I see several questions from people wanting to help, that have not yet been answered, that may help isolate your problem. Regarding this issue, an un-finalized upgrade would not typically be expected to take up a lot of storage space. The saved snapshot is not a copy of the whole storage system, but only of the directory structure (at the level of the native linux file systems that underpin the HDFS filesystem). This can take up a lot of inodes, but takes very little actual storage space. Then any blockfiles that get modified (by append or delete operations) are copied on an as-needed basis via a copy-on-write logic. So the only part of upgrade that actually uses storage is when individual components need to re-write data as part of the upgrade. This usually only affects meta-data files, which again are typically small compared to the vast amount of raw data. (Very rarely a component changes its storage format and needs to rewrite raw data, but this is very much avoided for obvious reason.) So cleaning up un-finalized upgrades is good practice, and will free some space, but should not be expected to free a lot of space in most cases.
... View more
12-16-2016
10:24 PM
Hi @Arsalan Siddiqi , yes, this is what I expected for Putty. By using the Putty configuration dialogue as shown in your screenshot, you ALREADY have an ssh connection to the guest VM, so typing `ssh` in the Putty terminal window is unnecessary and `scp` won't work -- instead of asking the host OS to talk to the guest OS, you would be asking the guest OS to talk to itself! I'm glad `pscp` (in a cmd window) worked for you -- and thanks for accepting the answer. Your last question, "why when i connect via Putty to root@127.0.0.1 p 2222 in windows and do a "dir" command, i see different folders as to when i do a DIR in virtualbox terminal??" is a puzzler. I would suggest, in each window (Putty terminal and Virtualbox terminal), you type the linux command `pwd`, which stands for "print working directory". It may be that the filesystem location you are ending up at login in the two tools is different. Also you can try the linux `ls` command.
... View more
12-16-2016
08:16 PM
(Got this wrong the first time! The following should be correct.) BTW, if your clusters have been inadvertently configured for Federation, and you want to disentangle them, it can be a little tricky. You can't just suddenly change all their configurations, because that would cause the namenodes to lose track of blocks on the remote datanodes. The needed process is too complex to detail here, but basically you need to do something like this:
The test for whether this problem exists is to open each NameNode's web UI ( http://<NameNode_FQDN>:50070 ) and navigate to the DataNodes page, then see if datanodes from the OTHER cluster are shown. If so, the two clusters have Federated. To separate them, first, change the configuration of each Namenode so that it stops serving namespaces from both clusters. I believe it would be best to reconfigure both Namenodes in each HA grouping at the same time, so a brief downtime will be needed. Then decommission each of the datanodes individually, from all clusters/namenodes, a few at a time (where "few" means a small percentage of the overall cluster size), so they flush their blocks to other nodes. Then, after each datanode is no longer referenced by any namespace, you change ITS configuration to only register to the Namenodes/Nameservices on its own cluster. Restart it, join it to its cluster, and continue with the next datanode. Alternate between the datanodes on the two clusters, so you don't run into storage space issues. Continue until all the datanodes have been so "gently" reconfigured to a single cluster's namespace(s). At the end of the process, all the nodes in each cluster will have a non-Federated config. This can be done without data loss and only brief NN downtime, but it will require significant time commitment, depending on the cluster sizes. Please don't try this unless you thoroughly understand the issues involved. If you have any doubts, consulting assistance may be appropriate.
... View more
12-16-2016
07:56 PM
1 Kudo
Hi
@jbarnett, this was really interesting to think about. The reason is that your question crosses the domains of three HDFS features: HA, Federation, and Client ViewFS. The key understanding required is that Clusters and Clients can, and sometimes should, be configured differently.
The blog you reference specifically says that the configuration he proposes is "a custom configuration you are
only going to use for distcp" (my emphasis added), and he specifically proposes it as a solution for doing distcp between HA clusters, where at any given time different namenodes may be the "active" namenode within each cluster that you're distcp-ing between.[1] He does not mention a critical point, which is that this configuration is for the Client only, not for the HDFS Cluster nodes (and the Client host where this is configured must not be one of the HDFS node servers), if you wish the two clusters to remain non-Federated. The behavior you describe for rebalancing would be the expected result if this configuration were also used for the cluster nodes themselves, causing them to Federate -- and in this case, yes, the undesired result would be balancing across the two clusters regardless of their membership. It is also probable, although I'm not certain, that if you give the `hdfs balancer` command from a Client with this configuration, that it will initiate rebalancing on both clusters; whether each cluster rebalances only within itself or across cluster boundaries I'm not 100% sure, but since the balancing is delegated to the namenodes, I would expect that if the clusters are each using a non-Federated configuration, then they would balance only within themselves.
So the simple answer to your question is: To keep the two clusters separate and non-Federated from each other, the nodes of each cluster should be configured with the hdfs-site.xml file appropriate to its own cluster, and not mentioning the other cluster. Since each cluster is itself HA, it will still need to use the "nameservice suffix" form of many of the configuration parameters, BUT it will only mention the nameservice(s) and namenodes hosted in that cluster; it will not mention nameservice(s)/namenode(s) hosted in the other cluster. (And, btw, the nameservices in the two clusters should have different names -- "fooSite1" and "fooSite2" -- even if they are intended to be mirrors of each other.) Your client hosts should be separate from your cluster nodes, and the two-cluster ViewFS Client configuration you use for invoking `distcp` should be different from the single-cluster ViewFS Client configurations you use for invoking `balancer` on each individual cluster, if you want to be able to trigger rebalancing separately. Make sense?
To get a clear understanding of these features (HA, Federation, and Client ViewFS), and how they relate to each other, I encourage you to read the authoritative sources at:
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/Federation.html
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/ViewFs.html Hope this helps.
-------------------
[1] He doesn't seem to include the final command which is the whole point of his blog: That after configuring the Client for a ViewFS view of the two clusters, you can now give the `distcp` command in the form:
hdfs distcp viewfs://serviceId1/path viewfs://serviceId2/path
and not have to worry about which namenode is active in each cluster.
... View more
12-15-2016
10:40 PM
@Arsalan Siddiqi , Sorry you're having difficulties. It's pretty hard to debug access problems remotely, as you know, but let's see what we can do. Let me clarify a couple things: Is it correct that you are able to use: ssh root@127.0.0.1 -p 2222 to connect over ssh in putty, but when you use: scp -P 2222 mylocalfile.txt root@127.0.0.1:/root/ it rejects the connection? Are you using all the parts in the above statement? The port: "-P 2222" (capital P, not lower case as with ssh). The root user id "root@" The colon after the IP address "127.0.0.1:" The destination directory (/root/ in the example above, but you can use any absolute directory path) If you've been using "localhost" as in the tutorial, try "127.0.0.1" instead. If you've been cut-and-pasting the command line, try typing it instead. The "-" symbol often doesn't cut-and-paste correctly, because formatted text may use character "m-dash" or "n-dash" instead of "hyphen". It is much safer to type it than to paste it. You are using Windows, correct? I'm a little confused about how you're connecting through Putty, as I remember Putty wanting the connection info in a dialogue box before the connection, whereas on a Mac or Linux box, the terminal application just opens a terminal on the box itself, and you then FURTHER connect via typing the ssh command. So, did you actually configure Putty with the "ssh" request, the port number, and the user and host info, in a dialogue box rather than typing an ssh command line? And did that work correctly? Assuming the answer to the above is "yes" and "yes", the next question is: Where are you typing the "scp" command? You can't type it into the Putty connection with the VM, that won't work. The scp command line is meant to be sent to your native box. Does Putty have a file transfer dialogue box that can use scp protocol? Is that what you're trying to use? Or have you downloaded the "pscp.exe" file from putty.org, and are using that? See http://www.it.cornell.edu/services/managed_servers/howto/file_transfer/fileputty.cfm
The full docs for Putty PSCP are at https://the.earth.li/~sgtatham/putty/0.67/htmldoc/Chapter5.html#pscp-usage and shows that pscp also takes a capital "-P" to specify the port number. Worst case, if you can't get any of these working, you've already established the Virtualbox connection for the VM. WIth some effort you can figure out how to configure a shared folder with your host machine, and use it to pass files back and forth.
... View more
12-14-2016
07:26 PM
2 Kudos
@Arsalan Siddiqi, I'll try to answer the several pieces of your question. First, I encourage you to go through the Sandbox Tutorial at http://hortonworks.com/hadoop-tutorial/learning-the-ropes-of-the-hortonworks-sandbox/ It will help you understand a great deal about the Sandbox, and what it is intended to do and how to use it. The sandbox comes with Ambari and the Stack pre-installed. You shouldn't need to change OS system settings in the Sandbox VM, which is intended to act like an "appliance". Also, the sandbox already has everything Ambari needs to run successfully, over the HTTP Web interface. No need to configure graphic packages on the sandbox VM. Ambari provides a quite nice GUI for a large variety of things you might want to do with an HDP Stack, including viewing and modifying configurations, seeing the health and activity levels of HDP services, stopping and re-starting component services, and even viewing contents of HDFS files. While you can view the component config files at (in most cases) /etc/<componentname>/conf/* in the sandbox VM's native file system, please DO NOT try to change configurations there. The configs are Ambari-managed, and like any Ambari installation, if you change the files yourself, the ambari-agent will just change them back! Instead, use the Ambari GUI to change any configurations you wish, then press the Save button, and restart the affected services (Ambari will prompt you). The data files for HDFS are stored as usual in the native filesystem location defined by HDFS config parameter "dfs.datanode.data.dir". However, it won't do you much good to go look there, because the blockfiles stored there are not readily human-understandable. As you may know, HDFS layers its own filesystem on top of the native file system, storing each replica of each block as a file in a datanode. If you want to check the contents of HDFS directories, you're much better off to use the HDFS file browser, as follows: In Ambari, select the HDFS service view, and pull down the "Quick Links" menu at the top center. Select "Namenode UI". In the Namenode UI, pull down the "Utilities" menu at the top right. Select "Browse the file system". This will take you to the "Browse Directory" UI. You may click thru the directory names at the right edge, or type an HDFS directory path into the text box at the top of the directory listing. If you click on a file name, you will see info about the blocks of that file (short files only have Block 0), and you may download the file if you want to see the contents. Note that HDFS files are always immutable. HDFS files may be appended to, but cannot be edited. For copying files to the Sandbox VM, first make sure you can access the sandbox through 'ssh', as documented near the beginning of the Tutorial under "Explore the Sandbox"; then see http://hortonworks.com/hadoop-tutorial/learning-the-ropes-of-the-hortonworks-sandbox/#send-data-btwn-sandbox-local-machine . Hope this helps.
... View more
12-14-2016
06:19 PM
One way to see whether any Stack services use hardwired "/user" prefix would be to use Ambari to install the whole Stack on a lab machine. Make a change to "dfs.user.home.dir.prefix", to something other than "/user", during Install Wizard time, BEFORE letting Ambari do the actual installation, thus making sure everything from the beginning sees the non-default value. Let it install, start all the services and let them run a few minutes, then see if anything got created under /user/* in HDFS. Sorry I don't have time to do the experiment right now, but if you do please report the results back here as a comment for others to learn from. If you find services that apparently hardwire the "/user" prefix, I'll enter bugs against those components and try to get them fixed.
... View more
12-13-2016
06:48 PM
As usual, parameter changes only affect things going forward. If users have already been created in the default location, their home directories will not be magically re-created in the new location. This could cause problems, depending on whether processes use the parameter vs using hardwired "/user" prefix. For your site-defined users, you can just move their home directories with dfs commands. For the pre-defined Stack service users, you'll need to experiment to see whether they want their home directories to stay in /user or be moved to value(dfs.user.home.dir.prefix). I would start by leaving them in place, but that's just a guess.
... View more
12-13-2016
06:39 PM
2 Kudos
@Sean Roberts As noted above, this is controlled by the 'dfs.user.home.dir.prefix' parameter in hdfs-site.xml. However, since this is not a commonly changed parameter, it isn't in the default Ambari configs for HDFS. (It just defaults to the value from hdfs-default.xml.) To change this value in Ambari, do the following: In Ambari, select the HDFS service, then select the "Configs" tab. Within Configs, select the "Advanced" tab Open the "Advanced hdfs-site" section, and confirm that this parameter isn't already present there. Open the "Custom hdfs-site" section, and click the "Add property" link A dialogue pops up, inviting you to type a "key=value" pair in the text field. Enter: dfs.user.home.dir.prefix=/user Press the "Add" button, and the new entry will be converted into a standard parameter entry field. Now change the value of the field to whatever you want (no blank spaces in the path, please). Of course after changing configurations, you have to press the "Save" button at the top of the window. That should do what you need, I hope.
... View more
06-24-2016
05:47 PM
1 Kudo
Hi @Yibing Liu When you created your local repo, did you follow the instructions at http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_Installing_HDP_AMB/content/_using_a_local_repository.html ? Did you use tarballs, or reposync? Did you provide an HTML server able to serve the repo contents? Finally, did you adjust the "baseurl" configurations in your ambari.repo, HDP.repo, and HDP-UTILS.repo files to point to the local repo? (and the "gpgkey" configuration in ambari.repo, unless you've turned it off) I don't really understand your statement "i also updated the version/stack to make my ambari connect to my local repo". Adjusting the .repo baseurls should be all that is needed. Is this a fresh install, or are you adding a new host to an existing cluster? Just to eliminate another common source of error, what version of python to you have installed on the server, and is it in the PATH for the userid performing the install?
... View more
06-23-2016
08:35 PM
@sprakash
The fact that distcp works with some configurations indicates you probably have Security set up right, as well as giving you an obvious work-around. To try to answer your question, please provide some clarifying information: When you speak of mapred-client.xml, do you mean mapred-site.xml on the client machine? When you speak of changing the framework, do you mean the "mapreduce.framework.name" configuration parameter in mapred-side.xml? Do you change it only on the client machine, or throughout both clusters? The allowed values of that parameter are "local", "classic", and "yarn". When you change it to not be "yarn", what do you set it to? Do you have "mapreduce.application.framework.path" set? If so, to what value?
... View more
04-26-2016
04:58 PM
Yes @Zaher Mahdhi, your intermediate machine is effectively a fifth machine in the cluster (as far as clients are concerned). In my answer just replace "the publicly addressible node" with "the intermediate machine", and my answer still applies in all regards.
... View more
04-26-2016
12:25 AM
@Zaher Mahdhi, first let me restate your question to make sure I understand the problem. You said that you have a 4-node cluster but only 1 node has a public IP addressable from outside the cluster, right? And when you initiate communication with a master node, you have set up port forwarding in the publicly addressable node to forward to the desired master. (tricky!) However, when you try to access a file in HDFS (actually whether by WebHDFS or regular HDFS client), the HDFS master (namenode) redirects you to the correct slave (datanode), and the IP address it gives you for the datanode is of course not addressable from outside the cluster. Right? Definitely the most straightforward solution is to run Knox on your publicly addressable node. Knox provides secure proxying for all other services in the cluster, while hiding the internal structure of the cluster from outside users. WebHDFS will work, and as a side benefit you will no longer have to set up your port forwarding scheme. Knox is also reasonably easy to configure. The only small negative is that Knox also bottlenecks all communication through the Knox server. But that's the way your port forwarding scheme works too, right? And there's no other evident solution if you can't give public IP addresses to the other nodes. So not a negative for you. The other, traditional but non-secure and functionally limited solution, is to require logging in on the publicly addressable node and doing all your work from there. I assume you've thought of that and don't find it satisfactory. The only other possible approach I have not tried in this context, so I'm not certain it will work, but I think it can. This would be to set the parameter 'dfs.client.use.datanode.hostname' to "true" in hdfs-site.xml on all namenodes, datanodes, and clients, and then set up the publicly addressable node as an IP proxy for each of the FQDN-named datanodes, for access by your Client node. See https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html#Clients_use_Hostnames_when_connecting_to_DataNodes for the semantics of dfs.client.use.datanode.hostname. Using this parameter requires that:
the cluster nodes have real FQDNs and not just IP addresses, DNS and rDNS services accessible to the cluster are set up, and all cluster nodes know their own FQDN, so that "round-trip" DNS and rDNS calls work for all cluster nodes from within the cluster. All the proxying has to be done down in the network layer, but if you know enough to set this up you can skip learning to configure Knox 🙂 The article Parameters for Multi-Homing is not directly related, but may help you understand the intended use of dfs.client.use.datanode.hostname (which is not exactly this use) and the assumptions that go with it. Hope this helps. Do consider the Knox solution, for both ease of setup and maintainability.
... View more
04-19-2016
07:04 PM
2 Kudos
As Benjamin said, strongly encourage you to establish your process with a small test cluster first. However, I do not expect problems with the data. Hadoop is written in Java, so the form of data should be same between operating systems, especially all Linux variants. Warning: Do not upgrade both operating system and HDP version all at once! Change one major variable at a time, and make sure the system is stable in between. So go ahead and change OS, but keep the HDP version the same until you are done and satisfied with the state of the new OS. The biggest potential gotcha is if you experience ClusterID mismatch as a result of your backup and restore process. If you are backing up the data by distcp-ing it between clusters, then this won't be an issue; the namespaceID/clusterID/blockpoolID probably will change, but it won't matter since distcp actually creates new files. But if you are trying to use traditional file-based backup and restore, from tape or a SAN, then you may experience this: After you think you've fully restored, and you try to start up HDFS it will tell you you need to format the file system, or the hdfs file system may simply appear empty despite the files all being back in place. If this happens, "ClusterID mismatch" is the first thing to check, starting with http://hortonworks.com/blog/hdfs-metadata-directories-explained/ for background. Won't say more because you probably won't have the problem and it will be confusing to talk about in the abstract.
... View more
04-19-2016
05:38 PM
@rock one How are you currently protecting the photo files in public_html from unauthorized access by non-owners? To do so you must intermediate the access to the files. I hope you are proxying them for your users, and not actually handing off raw URIs in a redirect. As long as you are already doing that, you can use any storage system without changing your program logic much. Just make sure you have a proper abstraction layer for storage system access. Then you can store and read files with whatever client calls are appropriate to your storage system, and can even migrate to different storage systems over time. With HDFS it would probably be simplest for you to use the WebHDFS REST API for writing and reading the photo files. But before committing to HDFS you should probably also look at Amazon's S3 + Glacier, Microsoft's Azure Storage, or Google's Cloud Storage + Nearline Storage. Any of them offer unlimited storage for $10-25/Terabyte/month, depending on redundancy level. It is unlikely you can match this cost (including operational costs, not just the price of cheap disk!) with an on-site storage system with comparable redundancy and availability, unless your data is only Terabyte-scale, or you work for a big company with an expert IT group. Of course you should run the numbers yourself, for your particular use case. As the number of individual photos stored in your system increases past several million, you may indeed need a big database to store the metadata. This is a separate issue from whether the blob storage is in HDFS or not. Cassandra or HBase would work fine for this. But neither of them will intrinsically solve your permissions and security needs, you still need server logic to handle that, just as you do today.
... View more