Member since
10-22-2015
83
Posts
84
Kudos Received
13
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1441 | 01-31-2018 07:47 PM | |
3584 | 02-27-2017 07:26 PM | |
2856 | 12-16-2016 07:56 PM | |
9439 | 12-14-2016 07:26 PM | |
4245 | 12-13-2016 06:39 PM |
01-26-2016
07:33 PM
Hi
@sivasaravanakumar k, versions of Hive used in HDP:
HDP version: Hive version
2.0: 0.12.0
2.1: 0.13.0
2.2: 0.14.0
2.3: 1.2.1
The recommendation would be simply upgrade your whole stack to HDP-2.3. You'll be much happier with the performance and manageability of the newer version, for both Hive and the rest of the stack.
If your original install wasn't with Ambari, then you can't use Ambari to do the upgrade. (Installing Ambari after a manual HDP install is extremely complex, see
https://community.hortonworks.com/questions/6703/ambari-server-installation-after-cluster-setup.html). To manually upgrade from HDP-2.0 to HDP-2.3, follow the documents at http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_upgrading_hdp_manually/content/ch_upgrade_2_0.html
If the original install with via Ambari, then you're in great shape, but you should do the upgrade in two steps. The reason for this two-step process is that the newest Ambari-2.2 gives the best user experience of the upgrade, but it only goes back to HDP-2.1, not HDP-2.0. So steps 1(a and b) uses your current Ambari to get you to HDP-2.1 first, which is a fairly simple step.
1(a) Your current Ambari is probably version 1.5.1 or better. If so, leave it. If not, upgrade Ambari, to version 1.6.1. See
http://docs.hortonworks.com/HDPDocuments/Ambari-1.6.1.0/bk_upgrading_Ambari/content/index.html
(b) Then use Ambari to upgrade from HDP-2.0 to HDP-2.1, instructions also at
http://docs.hortonworks.com/HDPDocuments/Ambari-1.6.1.0/bk_upgrading_Ambari/content/index.html, or see the version of instructions for your current version of Ambari.
2. Finally, upgrade Ambari to the latest version Ambari-2.2, and use it to upgrade the stack to HDP-2.3. See
http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_upgrading_Ambari/content/_ambari_upgrade_guide.html
Hope this helps. If so, please mark it accepted. Regards.
... View more
12-18-2015
06:11 PM
Hadoop version 2.7.1.2.3.0.0-2557 is the version of Apache Hadoop included in our product release HDP-2.3.0.0. As Dhruv mentioned, source code for all our product releases is publicly available in Github, under "github.com/hortonworks/<component>-release/". Then you need to look under "Tags" for the specific release version desired. In this case, you can look under HDP-2.3.0.0-tag (The link Dhruv provided is accidentally to version 2.3.3.0 rather than 2.3.0.0, although his pointing you at the pom.xml file was a good idea.) Our Windows and Linux releases share the same source code.
... View more
12-15-2015
10:43 PM
In the meantime, the Namenode is aware of all complete blocks and the block in progress for each open file, and will deliver that information to clients that ask for it. The clients can successfully read as much data as has been hflush'ed so far, even if it goes beyond the length the Namenode knows about so far.
... View more
12-15-2015
10:42 PM
1 Kudo
Hi Mark, good catch. We're both right. It's complicated 🙂 The file browser UI is served by the Namenode. As long as it takes to fill up a block, the Datanodes handle all the communication with the client, and replication. It is too expensive to update the Namenode every time the datanode receives a write operation. Only when the block fills up and a new block allocation is needed, or when the file is closed and the block is terminated, then the Namenode will be updated. And blocks are usually 128MB or more.
... View more
12-14-2015
07:52 PM
2 Kudos
In this response, I will assume your ingest application is using the standard HDFS Client library. If your client is running on a server that is not part of the Hadoop cluster, then there is almost no practical limit to the number of open HDFS files. Each Datanode can have several thousand simultaneous open connections, and the open files will be distributed randomly among all Datanodes. Of course you should also consider if there are other loads on your cluster that might also be enthusiastic readers or writers. BTW, if your client is actually running on a Datanode, which is not unusual in small operations or in laboratory setups, you should be aware that the first copy of all blocks of the files being written will be directed to the local Datanode as an optimization. In this case, you might want to limit the number of open HDFS files to 1000 or so, and/or distribute the ingest among several client instances on multiple Datanodes. You should probably be more concerned about resources on your client, which will be subject to the local OS limit on the number of simultaneously open files being read for ingestion, and the number of simultaneously open connections to HDFS streams being written. You mention "the more [streams] I keep open, the more small files I can pack into one HDFS file." This implies you are holding HDFS files open while waiting for ingest data to become available. This isn't really necessary, as you can use the append operation to extend previously closed files, in HDP-2.1 or newer. If you will have multiple client instances potentially trying to simultaneously append to the same HDFS file, however, I recommend you use HDP-2.3, as its version of Apache Hadoop is more efficient and has several bug fixes in append compared to previous versions. Of course, in all cases (create/write and append), only one client at a time has write access to any given file; any number of simultaneous readers are allowed, including during write. Regarding buffering, for each output stream, the HDFS client buffers 64kb chunks of data locally in RAM before flushing to the Datanode, unless you use explicit hflush calls. When the buffer fills, hflush is automatically called. After an hflush call returns successfully, the data is guaranteed to be in the datanodes, safe from client failures, and available for reading by other clients.
... View more
11-25-2015
08:06 PM
4 Kudos
Search of Hortonworks documentation indicates the following three requirements, besides general Kerberos correct setup: 1. Both clusters must be using Java 1.7 or better if you are using MIT kerberos. Java 1.6 has too many known bugs with cross-realm trust; eg see ref http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7061379 2. The same principal name must be assigned to the NameNodes in both the source and the destination cluster. For example, if the Kerberos principal name of the NameNode in cluster A is nn/host1@realm, the Kerberos principal name of the NameNode in cluster B must be nn/host2@realm, not, for example, nn2/host2@realm; see ref http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_Sys_Admin_Guides/content/ref-263ee41f-a0a9-4dea-ad4a-b3c257b8e188.1.html 3. Bi-directional cross-realm trust must be set up. Correct trust setup can be tested by running an hdfs client on a node from cluster A and see if you can put a file or list a directory on cluster B, and vice versa; credit Robert Molina in the old Hortonworks Forums, post-49303. Note: the key statement for items #2 and #3 is that "It is important that each NodeManager can reach and communicate with both the source and destination file systems"; see ref https://hadoop.apache.org/docs/r2.7.1/hadoop-distcp/DistCp.html. Therefore the trust must be bi-directional.
... View more
11-24-2015
12:35 AM
1 Kudo
Regarding "you won't see any new mount points": It's important to distinguish between the NFS Gateway service and the NFS client, even though they can both be on the same machine. NFS services export mountpoints, ie make them available for clients to mount. NFS clients mount them and make use of them as filesystems. It is true that for some applications, it would be convenient to have the NFS mountpoint mounted on the cluster nodes, but this is a Client functionality, not part of Gateway setup. And for many other applications, it is more important to have the NFS mountpoint available for use by other hosts outside the Hadoop cluster -- which can't be managed by Ambari.
... View more
11-20-2015
06:44 AM
1 Kudo
Note: Credit for the
key piece of information to solve this problem goes to Phil D’Amore. A customer had a problem writing a Java application that
used the Hive client libraries to talk to two secure Hadoop clusters that
resided in different Kerberos realms. The same problem could be encountered by a client connecting to a single
secure Hadoop cluster that happened not to be in the Kerberos “default_realm”
as specified in the client host’s krb5.conf file. The same problem could occur for any Hadoop
ecosystem client, not just Hive clients. In order to communicate with two different secure Hadoop
clusters, in different Kerberos realms, the client application did the
following things correctly: It harvested the needed configuration files (in
this case, core-site.xml, hdfs-site.xml, and hive-site.xml) from each target
cluster, and used the appropriate configuration when communicating with each
respective cluster. Its application user id had two Kerberos
principals, one registered and authenticated with each of the two KDCs, and
used the appropriate principal when authenticating to each respective cluster. On the client host, it had a krb5.conf file that
correctly specified Kerberos kdc and admin_server values for each of the two target
realms in the [realms] section, and set one of the realms as the “default_realm”
in the [libdefaults] section. (It could
also have set a third realm as the default_realm, it would just mean that both
target clusters would be in non-default realms, which is also fine.) However, when they ran the application, they had a puzzling
problem: They were able to authenticate
to the target cluster in the default realm, but failed with the target cluster
in the non-default realm. Indeed, after
the failure they found logs in the default_realm KDC that showed an incorrect
attempt to authenticate to the wrong
KDC. They knew they had not made a coding error, because changing
the default_realm to the other target cluster caused the situation to
reverse. Depending on the setting of
default_realm in krb5.conf file, they could talk to either cluster, but not both
at once. The problem was fixed by adding a [domain_realm] section to
the krb5.conf file. It turns out that the
Thrift libraries underlying the client have APIs that do not communicate the
target “realm”, but only the target server. The Kerberos libraries are responsible for translating from the target
server’s domain to the target realm. If
the domain and the realm have identical string values (except for upper/lower
case), which is common but not required, it will use that. Failing that, it will use the default
realm. It will not infer from the domain of the KDC servers. In this case the
domain and realm were different, so the authentication request for the
non-default realm was being sent to the default realm’s KDC. Adding a [domain_realm] section to the
krb5.conf file allows arbitrary mappings from target domains to target realms,
so Kerberos was finally able to translate from the desired target domain to the correct
target realm. See http://web.mit.edu/kerberos/krb5-1.12/doc/admin/conf_files/krb5_conf.html#domain-realm
for details of the krb5.conf file sections and contents.
... View more
Labels:
11-10-2015
06:19 PM
2 Kudos
I don't know from personal involvement, but it may be that all useful parts of the MLBase project have been absorbed into Spark ML, and no one chose to continue MLBase as a separate project. The MLBase project page itself says that MLlib is just the Spark project's MLlib, see https://github.com/apache/spark/tree/master/mllib The MLBase project page also says, "Many features in MLlib have been borrowed from ML Optimizer and MLI." That suggests that there was already a process of absorption happening in 2013, and perhaps after that there was insufficient motivation to continue developing ML Optimizer and MLI as separate components. In support of this idea, it appears that https://github.com/amplab/MLI/tree/master/src/main/scala/ml is a subset of the contents of https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/ml In my brief effort I was not able to similarly track down remnants of the "ML Optimizer" code, but certainly there are optimizers throughout the Spark ML code, and they tend to be algorithm-specific, so there wouldn't be much motivation for grouping them into a discrete component. Hope this helps.
... View more
11-04-2015
07:39 PM
3 Kudos
Also, robustness of NFS implementation varies a lot by vendor, and consequently the level of parallelism allowed. A low-end NAS box will only be able to handle a few simultaneous accesses. Something higher-end, like a big NetApp box, can probably handle a lot more, although with performance impact as Neeraj said. Also, don't forget to look at the effective network bandwidth between individual VMs and the NFS server, and the total bandwidth utilizable by the NFS server. VMs often have severely restricted bandwidth allocations configured, and both VMs and the NFS server will be limited by the mathematical constraints of sharing physical resources.
... View more
- « Previous
- Next »