About mfoley

mfoley · ‎01-26-2016

Hi @sivasaravanakumar k, versions of Hive used in HDP: HDP version: Hive version 2.0: 0.12.0 2.1: 0.13.0 2.2: 0.14.0 2.3: 1.2.1 The recommendation would be simply upgrade your whole stack to HDP-2.3. You'll be much happier with the performance and manageability of the newer version, for both Hive and the rest of the stack. If your original install wasn't with Ambari, then you can't use Ambari to do the upgrade. (Installing Ambari after a manual HDP install is extremely complex, see https://community.hortonworks.com/questions/6703/ambari-server-installation-after-cluster-setup.html). To manually upgrade from HDP-2.0 to HDP-2.3, follow the documents at http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_upgrading_hdp_manually/content/ch_upgrade_2_0.html If the original install with via Ambari, then you're in great shape, but you should do the upgrade in two steps. The reason for this two-step process is that the newest Ambari-2.2 gives the best user experience of the upgrade, but it only goes back to HDP-2.1, not HDP-2.0. So steps 1(a and b) uses your current Ambari to get you to HDP-2.1 first, which is a fairly simple step. 1(a) Your current Ambari is probably version 1.5.1 or better. If so, leave it. If not, upgrade Ambari, to version 1.6.1. See http://docs.hortonworks.com/HDPDocuments/Ambari-1.6.1.0/bk_upgrading_Ambari/content/index.html (b) Then use Ambari to upgrade from HDP-2.0 to HDP-2.1, instructions also at http://docs.hortonworks.com/HDPDocuments/Ambari-1.6.1.0/bk_upgrading_Ambari/content/index.html, or see the version of instructions for your current version of Ambari. 2. Finally, upgrade Ambari to the latest version Ambari-2.2, and use it to upgrade the stack to HDP-2.3. See http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_upgrading_Ambari/content/_ambari_upgrade_guide.html Hope this helps. If so, please mark it accepted. Regards.

mfoley · ‎12-18-2015

Hadoop version 2.7.1.2.3.0.0-2557 is the version of Apache Hadoop included in our product release HDP-2.3.0.0. As Dhruv mentioned, source code for all our product releases is publicly available in Github, under "github.com/hortonworks/<component>-release/". Then you need to look under "Tags" for the specific release version desired. In this case, you can look under HDP-2.3.0.0-tag (The link Dhruv provided is accidentally to version 2.3.3.0 rather than 2.3.0.0, although his pointing you at the pom.xml file was a good idea.) Our Windows and Linux releases share the same source code.

mfoley · ‎12-15-2015

In the meantime, the Namenode is aware of all complete blocks and the block in progress for each open file, and will deliver that information to clients that ask for it. The clients can successfully read as much data as has been hflush'ed so far, even if it goes beyond the length the Namenode knows about so far.

mfoley · ‎12-15-2015

Hi Mark, good catch. We're both right. It's complicated 🙂 The file browser UI is served by the Namenode. As long as it takes to fill up a block, the Datanodes handle all the communication with the client, and replication. It is too expensive to update the Namenode every time the datanode receives a write operation. Only when the block fills up and a new block allocation is needed, or when the file is closed and the block is terminated, then the Namenode will be updated. And blocks are usually 128MB or more.

mfoley · ‎12-14-2015

In this response, I will assume your ingest application is using the standard HDFS Client library. If your client is running on a server that is not part of the Hadoop cluster, then there is almost no practical limit to the number of open HDFS files. Each Datanode can have several thousand simultaneous open connections, and the open files will be distributed randomly among all Datanodes. Of course you should also consider if there are other loads on your cluster that might also be enthusiastic readers or writers. BTW, if your client is actually running on a Datanode, which is not unusual in small operations or in laboratory setups, you should be aware that the first copy of all blocks of the files being written will be directed to the local Datanode as an optimization. In this case, you might want to limit the number of open HDFS files to 1000 or so, and/or distribute the ingest among several client instances on multiple Datanodes. You should probably be more concerned about resources on your client, which will be subject to the local OS limit on the number of simultaneously open files being read for ingestion, and the number of simultaneously open connections to HDFS streams being written. You mention "the more [streams] I keep open, the more small files I can pack into one HDFS file." This implies you are holding HDFS files open while waiting for ingest data to become available. This isn't really necessary, as you can use the append operation to extend previously closed files, in HDP-2.1 or newer. If you will have multiple client instances potentially trying to simultaneously append to the same HDFS file, however, I recommend you use HDP-2.3, as its version of Apache Hadoop is more efficient and has several bug fixes in append compared to previous versions. Of course, in all cases (create/write and append), only one client at a time has write access to any given file; any number of simultaneous readers are allowed, including during write. Regarding buffering, for each output stream, the HDFS client buffers 64kb chunks of data locally in RAM before flushing to the Datanode, unless you use explicit hflush calls. When the buffer fills, hflush is automatically called. After an hflush call returns successfully, the data is guaranteed to be in the datanodes, safe from client failures, and available for reading by other clients.

mfoley · ‎11-25-2015

Search of Hortonworks documentation indicates the following three requirements, besides general Kerberos correct setup: 1. Both clusters must be using Java 1.7 or better if you are using MIT kerberos. Java 1.6 has too many known bugs with cross-realm trust; eg see ref http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7061379 2. The same principal name must be assigned to the NameNodes in both the source and the destination cluster. For example, if the Kerberos principal name of the NameNode in cluster A is nn/host1@realm, the Kerberos principal name of the NameNode in cluster B must be nn/host2@realm, not, for example, nn2/host2@realm; see ref http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_Sys_Admin_Guides/content/ref-263ee41f-a0a9-4dea-ad4a-b3c257b8e188.1.html 3. Bi-directional cross-realm trust must be set up. Correct trust setup can be tested by running an hdfs client on a node from cluster A and see if you can put a file or list a directory on cluster B, and vice versa; credit Robert Molina in the old Hortonworks Forums, post-49303. Note: the key statement for items #2 and #3 is that "It is important that each NodeManager can reach and communicate with both the source and destination file systems"; see ref https://hadoop.apache.org/docs/r2.7.1/hadoop-distcp/DistCp.html. Therefore the trust must be bi-directional.

mfoley · ‎11-24-2015

Regarding "you won't see any new mount points": It's important to distinguish between the NFS Gateway service and the NFS client, even though they can both be on the same machine. NFS services export mountpoints, ie make them available for clients to mount. NFS clients mount them and make use of them as filesystems. It is true that for some applications, it would be convenient to have the NFS mountpoint mounted on the cluster nodes, but this is a Client functionality, not part of Gateway setup. And for many other applications, it is more important to have the NFS mountpoint available for use by other hosts outside the Hadoop cluster -- which can't be managed by Ambari.

mfoley · ‎11-20-2015

Note: Credit for the key piece of information to solve this problem goes to Phil D’Amore. A customer had a problem writing a Java application that used the Hive client libraries to talk to two secure Hadoop clusters that resided in different Kerberos realms. The same problem could be encountered by a client connecting to a single secure Hadoop cluster that happened not to be in the Kerberos “default_realm” as specified in the client host’s krb5.conf file. The same problem could occur for any Hadoop ecosystem client, not just Hive clients. In order to communicate with two different secure Hadoop clusters, in different Kerberos realms, the client application did the following things correctly: It harvested the needed configuration files (in this case, core-site.xml, hdfs-site.xml, and hive-site.xml) from each target cluster, and used the appropriate configuration when communicating with each respective cluster. Its application user id had two Kerberos principals, one registered and authenticated with each of the two KDCs, and used the appropriate principal when authenticating to each respective cluster. On the client host, it had a krb5.conf file that correctly specified Kerberos kdc and admin_server values for each of the two target realms in the [realms] section, and set one of the realms as the “default_realm” in the [libdefaults] section. (It could also have set a third realm as the default_realm, it would just mean that both target clusters would be in non-default realms, which is also fine.) However, when they ran the application, they had a puzzling problem: They were able to authenticate to the target cluster in the default realm, but failed with the target cluster in the non-default realm. Indeed, after the failure they found logs in the default_realm KDC that showed an incorrect attempt to authenticate to the wrong KDC. They knew they had not made a coding error, because changing the default_realm to the other target cluster caused the situation to reverse. Depending on the setting of default_realm in krb5.conf file, they could talk to either cluster, but not both at once. The problem was fixed by adding a [domain_realm] section to the krb5.conf file. It turns out that the Thrift libraries underlying the client have APIs that do not communicate the target “realm”, but only the target server. The Kerberos libraries are responsible for translating from the target server’s domain to the target realm. If the domain and the realm have identical string values (except for upper/lower case), which is common but not required, it will use that. Failing that, it will use the default realm. It will not infer from the domain of the KDC servers. In this case the domain and realm were different, so the authentication request for the non-default realm was being sent to the default realm’s KDC. Adding a [domain_realm] section to the krb5.conf file allows arbitrary mappings from target domains to target realms, so Kerberos was finally able to translate from the desired target domain to the correct target realm. See http://web.mit.edu/kerberos/krb5-1.12/doc/admin/conf_files/krb5_conf.html#domain-realm for details of the krb5.conf file sections and contents.

mfoley · ‎11-10-2015

I don't know from personal involvement, but it may be that all useful parts of the MLBase project have been absorbed into Spark ML, and no one chose to continue MLBase as a separate project. The MLBase project page itself says that MLlib is just the Spark project's MLlib, see https://github.com/apache/spark/tree/master/mllib The MLBase project page also says, "Many features in MLlib have been borrowed from ML Optimizer and MLI." That suggests that there was already a process of absorption happening in 2013, and perhaps after that there was insufficient motivation to continue developing ML Optimizer and MLI as separate components. In support of this idea, it appears that https://github.com/amplab/MLI/tree/master/src/main/scala/ml is a subset of the contents of https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/ml In my brief effort I was not able to similarly track down remnants of the "ML Optimizer" code, but certainly there are optimizers throughout the Spark ML code, and they tend to be algorithm-specific, so there wouldn't be much motivation for grouping them into a discrete component. Hope this helps.

mfoley · ‎11-04-2015

Also, robustness of NFS implementation varies a lot by vendor, and consequently the level of parallelism allowed. A low-end NAS box will only be able to handle a few simultaneous accesses. Something higher-end, like a big NetApp box, can probably handle a lot more, although with performance impact as Neeraj said. Also, don't forget to look at the effective network bandwidth between individual VMs and the NFS server, and the total bandwidth utilizable by the NFS server. VMs often have severely restricted bandwidth allocations configured, and both VMs and the NFS server will be limited by the mathematical constraints of sharing physical resources.

Online	Offline
Last Visited	‎08-14-2019 06:45 PM

Member Since	‎10-22-2015 04:49 PM
Last Visited	‎08-14-2019 06:45 PM
Posts	83
Kudos received	79

Cloudera Community

Re: how to check all component on master are stop ...

Re: Network Bonding on Hadoop Cluster with Centos ...

Re: With two HA clusters configured for cross-clus...

Re: file location in HDP

Re: Is it possible to change the default "home" di...

Re: how to upgrade hortonworks hive version 0.12 ...

Re: Specific version of source code (2.7.1.2.3.0.0...

Re: Practical limits on number of simultaneous ope...

Re: Practical limits on number of simultaneous ope...

Re: Practical limits on number of simultaneous ope...

Re: Has anyone done distcp between secured cluster...

Re: Using NFS with Ambari 2.1 and above

Java Client connecting to Secure Cluster in Non-De...

Re: Spark MLBase (MLI and ML Optimizer) status

Re: NFS Share for HDFS Storage support multiple mo...