About cstanca

cstanca · ‎09-16-2016

@Jitendra Yadav Correct. That is a maintenance release dealing with bugs associated with features that already part of the release.

cstanca · ‎09-16-2016

@P D Here is the public repo for Ambari 2.4.0.1: http://s3.amazonaws.com/public-repo-1.hortonworks.com/index.html#/ambari/centos6/2.x/updates/2.4.0.1 Be patient! It may come-up slowly. Ambari 2.4.1 public repo is not yet available. You can build from source as documented here: https://cwiki.apache.org/confluence/display/AMBARI/Installation+Guide+for+Ambari+2.4.1 I'll add an update to this response when the public repo will be available. It should be soon.

cstanca · ‎09-16-2016

@Anil Bagga, you can follow the documentation for the upgrade that @ssathish specified, however, I would like to emphasize a few important tasks that need to make your checklist. They may look like a no brainer but quite often unhealthy installations are upgraded and when issues occur, it is much more difficult to debug what happened. As such, I always recommend to check your current installation, identify and address issues before upgrade. The purpose is to confirm that the cluster is healthy and will experience minimal service disruption before attempting an upgrade. Use the Ambari Web UI to ensure that all services in the cluster are running. For each service in the cluster, run the Service Check on the service’s Service Actions menu to confirm that the service is operational. Service checks are used extensively during rolling upgrade so if they fail when run manually, they will likely fail during upgrade too. For each service use the Stop and Start buttons in the Ambari Web UI to verify that the service can be stopped and started. Services are repeatedly stopped and started during upgrade. If they fail when initiated manually, they will likely fail during upgrade too. Understand and, as necessary, remediate any Ambari alerts. Ensure that you have an up-to-date backup of any supporting databases, including Hive, Ranger, Oozie, and any others. Enable HDFS, YARN, HBase, and Hive HA to minimize service disruption. Ensure that each cluster node has at least 2.5 GB of disk space available for each HDP version. The target installation directory is /usr/hdp/<version>. Operationally, be aware that new service user accounts might be created to support new software projects that were not installed as part of the earlier HDP release. For example, these new user accounts might need to be added to an LDAP server or created as Kerberos principals. Another thing that is extremely important is to understand what are the issues with your existent version and workarounds in place, as well known issues of the new release and if existent issues are not fixed, how would your port the fixes. Not last, test, test, test and finalize your upgrade when you are convinced that you did everything that was necessary to reduce risk. Until FINALIZE you can always rollback, after that is more difficult. As you know, with the recent versions of Ambari you can do either Rolling or Express Upgrade. It depends on your business requirements, but an Express Upgrade can be done during a maintenance window, while Rolling Upgrade for a large cluster can take significant time. Current release of Ambari does the upgrade sequentially, one node at the time. There is no parallelism. Good luck!

cstanca · ‎09-16-2016

@Anil Bagga @jk answered, Configuration Groups is the way to go. I will not repeat the references and the paragraph he pasted from the reference, but I would like to elaborate more on the practical aspects of using configuration groups not only for heterogenous infrastructure, but also for cases when the existent homogeneous infrastructure becomes heterogeneous due to hardware failures, e.g. failed drives which can make your current infrastructure to behave like heterogeneous. That is another scenario for using Configuration Groups. From the configuration point of view, your best approach is to use Configuration Groups to manage various data nodes infrastructure, however, there is more to it. With the new servers you may have more and faster storage on those new nodes.Your YARN containers sizing is defined globally (RAM and cores) as such if some nodes can store more data you need to have more cores and RAM on those nodes to be able to process the data in all nodes with a similar performance to avoid long running tasks on some of the nodes which in case of MapReduce can lead to lose of overall performance. Also, don't forget to balance the data across all nodes after you add the new data nodes. Third and not last, test your applications and monitor resources use across your infrastructure. There are always ways to improve performance by improving the design of the SQL or application to leverage the infrastructure evenly for best parallelism.

cstanca · ‎09-16-2016

@Anil Bagga As jk also mentioned, the most recent Ambari version was just launched: 2.4.1. Documentation is on Ambari's website: https://ambari.apache.org and the roadmap on the wiki: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30755705. However, I'd like to you to look at the great presentation that Jeff Sposetti delivered at our New York meetup on 9/13/2016: http://www.slideshare.net/dbist/past-present-and-future-of-apache-ambari . As you can see in the presentation, a few features included in 2.4.1 are very exciting, e.g. Log Search (centralized view of logs), Automated Cluster Upgrade and many more are included or to come. Jeff's almost 100-pages presentation includes also the roadmap.

cstanca · ‎09-12-2016

@gkeys Before discussing ORC, let's keep in mind that S3 is a very good storage space when it comes to cold storage. The S3 data moves at about 50Mbps (could be more or less, but much slower than HDFS). It is a choice for you to pick between speed & cost. Optimizations will only alleviate some performance differences between ORC in HDFS vs ORC on S3, but the data movement limitations will still prevail.

cstanca · ‎09-11-2016

@Suzanne Dimant Good for you! I guess that if you look retrospectively, I responded to most of your question as it was stated.

cstanca · ‎09-10-2016

@kishore sanchina Watch this: http://scala-ide.org/download/current.html In principle the steps are: 1: go to http://scala-ide.org/ and download the version needed for your Linux 32 or 64-bit to your linux server/workstation. The latest version is 4.4.1. 2: meet JDK requirements: JDK 6, 7, or 8 3: copy the archive to your preferred folder and decompress. 4. find eclipse executable and execute it 5. follow the steps suggested in the splash screens Good luck! If a response addressed your question, don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

cstanca · ‎09-10-2016

@mliem I will tackle your questions in order, for your 2 + 3 cluster: 1. If you want HA for your cluster 2 + 3, then 3 zookeepers. If HA is not a requirement, then 1. The number needs to be odd to meet quorum requirements. Even numbers can lead to brain split. 2. Zookeeper Server is considered a MASTER component in Ambari terminology. As such, use two of your master nodes and 1 data node to achieve 3 zookeeper requirement of HA. If non-HA then place it on one of your master nodes. 3. No other options in your case. In case that three master nodes were available, the third zookeeper should be on the third master node. I simplified my responses using the assumption that you would use just basic services like HDFS and Hive. Even if you were using Kafka and Storm, the responses won't change for your very small cluster. If the cluster was larger, then you could consider allocating separate zookeepers for Kafka and Storm. If you were using HBase, then the story would be slightly different, even for your small cluster. Apache HBase by default manages a ZooKeeper "cluster" for you. It will start and stop the ZooKeeper ensemble as part of the HBase start/stop process. For HBase, it's better to have the zookeepers on region servers which in your case would be probably what you call data nodes. A cluster than runs many services a 2+3 is a stretch for best practices. If any of the responses to your question helped, don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

cstanca · ‎09-09-2016

@Deepesh Bhatia Before anything, many containers AVAILABLE and SIZED to do the job is a GOOD THING! Many containers needed and underused (BADLY SIZED) is not a GOOD THING! What you want is the least amount of resources for the fastest executed job. I, intentionally, did not say the least number of containers. All it maters is using optimal resources for performance. Let's get to your expected answers: 1. One container is used for each task. A mapper is a task. A reducer is a task. If a query is translated in 10 mappers and 1 reducer, that is 10+1=11 containers to finish the task. If your default queue (let's assume) has 64 GB of RAM and 64 cores then if you set the memory per container to 1 GB and number of cores per container to 1, YARN can allocate for all the jobs running at the same time, up-to 64 containers. Containers can be reused (that setting is true by default) to reduce the overhead for creating new containers. If you understand how much data is processed per task and determine the amount of data is, let't say up-to 512MB then you could reduce the memory allocated per container, but that won't make sense if you don't have more cores to take advantage of 128 containers x 512 MB. If you had 128 cores then you could have 128 containers. You can reduce the number of containers by increasing the size of RAM per container and number of cores per container, but why would you do that? The point is that you need to create containers of the best size to handle your jobs mix. If they have too much resources allocated and under-used then you waste resources. If they are too small to process the data per task then you have bottlenecks. The best practice is to create a container size globally that meets the majority of requirements, however, at the individual job level you can override to set the container size based on the job needs.' 2. You have the root queue. Default is a child of the root queue. Assuming that you create another queue and segregate the resources of the root queue as 50% for default and 50% for the new queue (simplified), of you don't specify to which queue to submit the job, the job will be submitted to default. Assuming that you delete Default, you always have to set the execution queue. If you forget to do it, since there is no default queue, your job will just hang in there because it does not have resources allocated until you kill it and resend it to a specific queue. Resource Manager UI shows load per queue, including number of containers, their RAM and core utilization. Each job shows in the queue it was submitted. 3. Have enough resources available for those tasks. If you have applications that cannot wait for execution you need to either create a queue that guarantees those resources, or increase the overall resources of your cluster, or optimize your jobs to use less resources. References: YARN (what it is, what it does etc): http://hortonworks.com/apache/yarn/#section_1 Resource Manager is also covered in the documentation and available from Ambari UI via YARN and Quick Links at the top/center of the screen. If any of the responses addressed your question, please don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: Ambari release versioning

Re: Where is the repo for Ambari 2.4?

Re: I plan an upgrade from HDP 2.3.4 to HDP 2.4.0....

Re: I’d like to add more data nodes to my cluster ...

Re: What is the latest Ambari version and where ca...

Re: Are there any special considerations or optimi...

Re: Cannot scp a file from local machine to Azure ...

Re: How to Install Scala eclipse IDE in Linux?

Re: Zookeeper on even master nodes

Re: yarn 100% application uses multiple containers...