About egarelnabi

egarelnabi · ‎10-08-2016

Plan and Assess This is a purely planning step. The expected deliverable is an Upgrade Plan. Gather all details about existing environment to plan for the upgrade path and associated upgrade tasks. 1) Determine Upgrade Path Based on the current and target version of the HDP stack, and whether Ambari is used or not, select the supported upgrade guide from Hortonworks documentation site. Identify key requirement if Namenode HA or other HA needs to be disabled or Security needs to be disabled. Current version: ● HDP Stack version ● Ambari version (if Ambari is used) ● OS Version Target version: ● HDP Stack version ● Ambari version (if Ambari is used) Below are some useful links HDP Stacks Managed by Different Ambari Versions: https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.0.0/bk_ambari-installation/content/determine_stack_compatibility.html Upgrading to Ambari 2.4: https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.0.1/bk_ambari-upgrade/content/upgrading_ambari.html Upgrading HDP Using Ambari: https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.0.1/bk_ambari-upgrade/content/upgrading_hdp_stack.html Upgrading HDP Manually (without Ambari): https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_command-line-upgrade/content/ch_upgrade_2_4.html 2) Review Known Issues in Target Version Release Review the following items: Behavioral Changes that will affect applications ● Unsupported features ● Known Issues ● New features added to release HDP 2.5 Release Notes: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_release-notes/content/ch_relnotes_v250.html HDP 2.5 Known Issues: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_release-notes/content/known_issues.html 3) Select Validation Applications Select two groups of validation applications. First group: Industrial benchmarks like Teragen & Terasort, TestDFSIO, Hive TPC-DS, and HBase performance tests. As the minimum use Teragen & Terasort with multiple mappers for Teragen and multiple reducers for Terasort. Second group (optional): User defined validation applications . Identify representative applications (together with the input data) which are being used most often. Be sure to include at least one for every used Hadoop component like MapReduce, Hive, Pig, HBase, Oozie, Storm, Kafka and others. 4) Finalize Project Management Items ● Scope: Identify clusters to be upgraded and components to upgrade and newly install (if any). ● HR: Staff upgrade teams. Also, some validation applications can be run by developers themselves. ● Time: Identify upgrade tasks, timeline and task owners. ● QA: Carefully identify validation tasks ● Risk: Estimate down-time for each cluster upgrade. ● Resources: Prepare the cluster on which the upgrade will be tested (eg., Dev). When upgrading production clusters it is strongly recommended to attempt the upgrade first on a test cluster. **See Also** HDP Upgrade Best Practices - 2) Do the Upgrade HDP Upgrade Best Practices - 3) Documentation and Learnings

egarelnabi · ‎09-23-2016

Thanks @deepak sharma. We're still not on HDP 2.5. Does this apply to HDP 2.3.4 and 2.4.2 or is it only 2.5+? Also, can we connect to secure Solr instance rather than SolrCloud?

egarelnabi · ‎09-23-2016

My client is using Solr for Ranger audit logs. It appears that enabling Solr results in a Solr instance devoid of any security? What are the recommended paths to secure this particular instance of Solr?

egarelnabi · ‎09-09-2016

Disable Transparent Huge Pages (THP) Transparent Huge Pages (THP) is a Linux memory management system that reduces the overhead of Translation Lookaside Buffer (TLB) lookups on machines with large amounts of memory by using larger memory pages. However THP feature is known to perform poorly in Hadoop cluster and results in excessively high CPU utilization. Disable THP to reduce the amount of system CPU utilization on your worker nodes. This can be done by ensuring that both proc entries are set to [never] instead of [always]. Use Recommended File System Types Some file systems offer better performance and stability than others. As such, the HDFS dfs.datanode.data.dir and YARN yarn.nodemanager.local-dirs should be configured to use mount points that are not formatted with the most optimal file systems. Take a look at this article on file system choices: https://community.hortonworks.com/articles/14508/best-practices-linux-file-systems-for-hdfs.html Disable Host Swappiness The Linux kernel provides a tweakable setting that controls how often the swap file is used, called swappiness. A swappiness setting of zero means that the disk will be avoided unless absolutely necessary (when host runs out of memory), while a swappiness setting of 100 means that programs will be swapped to disk almost instantly. Reducing the value for swappiness reduces the likelihood that the Linux kernel will push application memory from memory into swap space. Swap space is much slower than memory as it is backed by disk instead of RAM. Processes that are swapped to disk are likely to experience pauses, which may cause issues and missed SLAs. Add `vm.swappiness=0` to /etc/sysctl.conf and reboot for the change to take effect. Or you can also change the value while your system is still running `sysctl -w vm.swappiness=0`. Also clear your swap by running `swapoff -a` and then `swapon -a` as root instead of rebooting to achieve the same effect. Improve Virtual Memory Usage The vm.dirty_background_ratio and vm.dirty_ratio parameters control the percentage of system memory that can be filled with memory pages that still need to be written to disk. Ratios too small force frequent IO operations, and too large leave too much data stored in volatile memory, so optimizing this ration is a careful balance between optimizing IO operations and reducing risk of data loss. Update vm.dirty_background_ratio=20 and vm.dirty_ratio=50 in /etc/sysctl.conf and reboot for the change to take effect, or change the values while your system is still running using `sysctl -p`. Configure CPUs for Performance Scaling CPU Scaling is configurable and defaults commonly to favor power saving over performance. For Hadoop clusters, it is important that we configure then for better performance over other options. Please set scaling governors to performance, which means running the CPU at maximum frequency. To do so run `cpufreq-set -r -g performance` OR edit /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor and set the content to 'performance' Tune SSD Configurations SSDs provide great performance boost. If configured optimally for Hadoop workloads, they can provide even better results. Scheduler, read buffers, number of requests etc are the parameters to consider for tuning. Refer following link for further details: https://wiki.archlinux.org/index.php/Solid_State_Drives#I.2FO_Scheduler For all of SSD devices, set following things echo 'deadline' > {{device}}/queue/scheduler ; echo '256' > {{device}}/queue/read_ahead_kb ; echo '256' > {{device}}/queue/nr_requests ; *** You might also be interested in the following articles: *** HDFS Settings for Better Hadoop Performance

egarelnabi · ‎09-01-2016

For Secure Impersonation / proxyusers, is there a way to blacklist certain users so even if they are added to the group, they won’t be allowed to be impersonated?

egarelnabi · ‎09-01-2016

Neither. Cloudbreak 2, which will be launched in a few weeks, is the appropriate version to deploy HDP 2.5

egarelnabi · ‎09-01-2016

While Cloudbreak 1.3 is available, the first link is more accurate in that it is in Technical Preview. Cloudbreak 2 is expected to be launched in a few weeks. So, if possible, I would wait for that, especially if you're interested in deploying HDP 2.5.

egarelnabi · ‎08-31-2016

@sandrine G HDP 2.5 includes both, Hive 1.2.1 and Hive 2.1. However, Hive 2.1 is in technical preview and is not supported. It can be enabled from Ambari if you'd like to give it a try http://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/

egarelnabi · ‎08-29-2016

Can someone please share how to use distcp+oozie (not Falcon) for cluster DR/replication. My understanding is that the entire distcp job will fail if any file in the path is being written to, and the best way around that would be to do the distcp against snapshots. But what is the entire end to end process? Also, what checks can be done on the DR cluster to ensure the success of the job and that the data is synced with the metastore?

egarelnabi · ‎08-29-2016

We have an application (Datameer) that requires superuser access by being a member in the HDFS supergroup. What options are available for securing/restricting that user's access to files and folders on HDFS? With Ranger 0.6+ (HDP 2.5+) we can use Deny or Exclude Conditions (https://cwiki.apache.org/confluence/display/RANGER/Deny-conditions+and+excludes+in+Ranger+policies), but what do we do with previous versions like HDP 2.4 (Ranger 0.5.2)?

Online	Offline
Last Visited	‎08-14-2019 09:54 AM

Member Since	‎10-06-2015 09:21 PM
Last Visited	‎08-14-2019 09:54 AM
Posts	273
Kudos received	202

Cloudera Community

Re: Is it possible to import a complete new taxono...

Re: Is it possible in Apache Atlas to add key-valu...

Re: Do we have tag carry forward in atlas hdp2.6.1...

Re: With ATLAS, which format attribute Date is acc...

Re: Spark streaming support for stream analytics m...

HDP Upgrade Best Practices - 1) Plan and Assess

Re: Securing Solr for Ranger Audit Logs

Securing Solr for Ranger Audit Logs

OS Configurations for Better Hadoop Performance

Restricting Secure Impersonation / Proxy Users

Re: Can Cloudbreak deploy HDP 2.5?

Re: Which version of Cloudbreak is the latest, 1....

Re: hive 2.0 availability ?

DR/replication strategy using distcp + Oozie

Can we set exceptions to a SuperUser's access perm...