Member since
09-29-2015
286
Posts
601
Kudos Received
60
Solutions
04-13-2018
03:08 PM
@Dominika Bialek Looks like the CloudBreak 2.5 docs were removed.
... View more
02-10-2017
05:21 AM
This was awesome Tim
... View more
01-19-2017
04:10 PM
3 Kudos
HDB 2.1.1 Reference: http://hdb.docs.pivotal.io/211 http://hdb.docs.pivotal.io/211/hdb/releasenotes/HAWQ211ReleaseNotes.html http://hdb.docs.pivotal.io/211/hdb/install/install-ambari.html Download HDB from Hortonworks at http://hortonworks.com/downloads/ or directly from Pivotal at https://network.pivotal.io/products/pivotal-hdb (You need to create a pivotal account) What to look out for
If you use only only 1 Master Node, you cannot have a Hawq Master and Standby If I install Hawq Master on Same node with Ambari need to change PostGres Port from 5432 on Install Prep
Ensure that httpd is installed yum install httpd
sudo service httpd status
sudo service httpd start Get and Install repo
Log onto Pivotal and download hdb-2.1.1.0-7.tar /* On Ambari Node */
1. mkdir /staging
2. chmod a+rx /staging
3. scp -i <<your key>> -o 'StrictHostKeyChecking=no' hdb-2.1.1.0-7.tar root@<<ambarinode>>:~/staging
4. tar -zxvf hdb-2.1.1.0-7.tarcd /staging/hdb-2.1.1.0./setup_repo.sh
/* You should see the message “hdb-2.1.1.0 Repo file successfully created at /etc/yum.repos.d/hdb-2.1.1.0.repo. */
5. yum install -y hawq-ambari-plugin
6. cd /var/lib/hawq
7. ./add-hawq.py --user admin --password admin --stack HDP-2.5
/* if the repo is in the same node as Ambari else pint to where the repo lives*/
./add-hawq.py --user <admin-username> --password <admin-password> --stack HDP-2.5 --hawqrepo <hdb-2.1.x-url> --addonsrepo <hdb-add-ons-2.1.x-url>
8. ambari-server restart Configurations during Install with Ambari
Set VM overcommit to 0 if you plan to use Hive and/or LLAP also on the same cluster; Don’t follow Pivotal docs to set this to 2 ele your Hive processes will have memory issues. Advanced hdfs-site Property Setting dfs.allow.truncate true dfs.block.access.token.enable false for an unsecured HDFS cluster, or true for a secure cluster dfs.block.local-path-access.user gpadmin dfs.client.read.shortcircuit true dfs.client.socket-timeout*** 300000000 dfs.client.use.legacy.blockreader.local false dfs.datanode.handler.count 60 dfs.datanode.socket.write.timeout*** 7200000 dfs.namenode.handler.count 600 dfs.support.append true
Advanced core-site Property Setting ipc.client.connection.maxidletime** 3600000 ipc.client.connect.timeout** 300000 ipc.server.listen.queue.size 3300 Some HAWQ Resources
Date Type Formating Functions: https://www.postgresql.org/docs/8.2/static/functions-formatting.html Date Time Functions: https://www.postgresql.org/docs/8.2/static/functions-datetime.html Hawq Date Functions: http://tapoueh.org/blog/2013/08/20-Window-Functions HAWQ is better with dates; can automatically handle ’08/01/2016’ and ’01-Aug-2016’ PostGreSQL Cheat Sheet Commands: http://www.postgresonline.com/downloads/special_feature/postgresql83_psql_cheatsheet.pdf System Catalog Tables: http://hdb.docs.pivotal.io/131/docs-hawq-shared/ref_guide/system_catalogs/catalog_ref-tables.html HAWQ Toolkit
Make sure and make use of the Hawq Toolkit: http://hdb.docs.pivotal.io/211/hawq/reference/toolkit/hawq_toolkit.html How to find the data files for specific tables: https://discuss.pivotal.io/hc/en-us/articles/204072646-Pivotal-HAWQ-find-data-files-for-specific-tables Size of table on Disk: select * from hawq_toolkit.hawq_size_of_table_disk; How to find the Size of Database: select sodddatname, sodddatsize/(1024*1024) from hawq_toolkit.hawq_size_of_database; How to find the Size of Partitioned Tables: select * hawq_toolkit.hawq_size_of_partition_and_indexes_disk Tip to find how many segments for a Hawq Table SELECT gp_segment_id, COUNT(*)
FROM <<table>>
GROUP BY gp_segment_id
ORDER BY 1; Creating Tables <<TBD>
Make SURE AFTER YOU CREATE THE TABLE ANALYZE: As an Example: vacuum analyze device.priority_counter_hist_rand; Loading Data to Tables <<TBD> Potential HAWQ Errors Too many open files in system To fix this check the value for fs.file-max in /etc/sysctl.conf. If configured a value that is lower than the total # of open files for the entire system at a given point (lsof | wc -l) then we would have increase this. To increase this value follow the below steps
Open Files: lsof | wc -l ulimit -a | grep open Edit the following line in the /etc/sysctl.conf file: fs.file-max = value #value is the new file descriptor limit that you want to set. Apply the change by running the following command:# /sbin/sysctl -p We can disable over-commit temporarily: echo 0 > /proc/sys/vm/overcommit_memory For permanent solution:
Add vm.overcommit_memory = 0 in /etc/sysctl.conf #fs.file-max=65536 fs.file-max=2900000 #Added for Hortonworks HDB kernel.threads-max=798720 vm.overcommit_memory=0
... View more
11-11-2016
06:18 PM
5 Kudos
Here are the Requirements: Total Data Size - Uncompressed: 13.5TB; Compressed: 2 TB Large Virtual Fact Table, View containing a Union All of 3 Large Tables, 11 Billion Records in Total Size Another view taking the large virtual fact table, with consecutive Left Out Joins on 8 Dimension Tables, so that no matter what 11 Billion records is always the result. There is timestamp data that you can use to filter rows by. Suppose you were given the following. How would you begin configuring Hortonworks for Hive? Would you focus on storage? How can we configure for compute? Lets assume: Platform: AWS Data Node Instance: r3_4xlarge Cores: 16 RAM: 122 GB EBS Storage: 2 x 1TB Disks So where do we begin? First Some Quick Calculations: Memory per Core: 122GB/16 = 7.625; Approximate 8 GB per CPU Core This means our largest Container Size PER Node per core is 8 GB
However we should not reserve all 16 Cores to Hadoop. Some Cores are need for OS and other processes. Let's Assume 14 Cores is reserved for YARN. Memory Allocated for All YARN containers on a node = No. of Virtual Cores x Memory Per Core
114688 MB = 14 * 8192 MB (8 *1024)
Note Also At 8 GB, we can run in parallel 14 Tasks (Mappers or Reducers), one per CPU, without wasting RAM. We can certainly run container sizes less than 8GB if we wish, Since our Optimal Container Size per Node is 8 GB, our Yarn Minimum Container Size must be a factor of 8GB to prevent wastage of memory, that is: 1,2,4,8 However Tez Container Size for Hive is a multiple of Yarn Minimum Container Size
Memory Settings YARN Hive TEZ Running Application Error
... View more
Labels:
08-12-2016
01:11 AM
Here is also a good article:
https://community.hortonworks.com/articles/22756/quickly-enable-ssl-encryption-for-hadoop-component.html
... View more
05-20-2016
05:23 PM
See also https://github.com/steveloughran/kerberos_and_hadoop/blob/master/sections/errors.md
... View more
03-11-2016
12:43 AM
40 Kudos
How Does Tez determine the number of reducers? How can I control this for performance? In this article, I will attempt to answer this while executing and tuning an actual query to illustrate the concepts. Then I will provide a summary with a full explanation. if you wish, you can advance ahead to the summary. ------------- 0. Prep Work and Checklist
We followed the Tez Memory Tuning steps as outlined in https://community.hortonworks.com/content/kbentry/14309/demystify-tez-tuning-step-by-step.html We setup our environment, turning CBO and Vectorization On.
set hive.support.sql11.reserved.keywords=false;
set hive.execution.engine=tez;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.vectorized.execution.reduce.groupby.enabled = true;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=16;
We create Orc tables and did an Insert Overwrite into Table with Partitions set hive.exec.dynamic.partition.mode=nonstrict;
#There is a danger with many partition columns to generate many broken files in ORC. To prevent that
> set hive.optimize.sort.dynamic.partition=true;
#if hive jobs previously ran much faster than in the current released
version, look into potentially setting property
> hive.optimize.sort.dynamic.partition = false .
> insert overwrite table benchmark_rawlogs_orc partition (p_silo,p_day,p_clienthash)
select * FROM <original table>;
We generated the statistics we needed for use in the Query Execution -- // generate statistics for the ORC table
set hive.stats.autogather=true; -- // To Generate Statistics for Entire Table and Columns for All Days (Longer)
ANALYZE TABLE rawlogs.benchmark_rawlogs_orc partition (p_silo, p_day, p_clienthash) COMPUTE STATISTICS;
ANALYZE TABLE rawlogs.benchmark_rawlogs_orc partition (p_silo, p_day, p_clienthash) COMPUTE STATISTICS for columns;
-------------------------------- 1. First Execution of Query Here we can see 61 Mappers were created, which is determined by the group splits and if not grouped, most likely corresponding to number of files or split sizes in the Orc table. For a discussion on the number of mappers determined by Tez see How are Mappers Determined For a Query and How initial task parallelism works The mappers complete quickly but the the execution is stuck on 89% for a long time. We observe that there are three vertices in this run, one Mapper stage and two reducer stages. The first reducer stage ONLY has two reducers that have been running forever? hmmmm... Query finally completed in 60 secs. What gives? Why only 2 Reducers? Let's look at the Explain plan. ------------------------------------------------------- 2. The LONGGGGGG Explain Plan Let's look at the relevant portions of this explain plan. We see in Red that in the Reducers stage, 14.5 TB of data, across 13 million rows are processed. This is a lot of data to funnel through just two reducers. The final output of the reducers is just 190944 bytes (in yellow), after initial group bys of count, min and max. We need to increase the number of reducers. ------------------------------------------- 3. Set Tez Performance Tuning Parameters When Tez executes a query, it initially determines the number of reducers it needs and automatically adjusts as needed based on the number of bytes processed. - Manually set number of Reducers (not recommended) To manually set the number of reduces we can use parameter mapred.reduce.tasks. By default it is set to -1, which lets Tez automatically determine the number of reducers. However you are manually set it to the number of reducer tasks (not recommended)
> set mapred.reduce.tasks = 38; It is better let Tez determine this and make the proper changes within its framework, instead of using the brute force method. > set mapred.reduce.tasks = -1;
- How to Properly Set Number of Reducers First we double check if auto reducer parallelism is on. The parameter is hive.tez.auto.reducer.parallelism See https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.tez.auto.reducer.parallelism It is set to true. #Turn on Tez' auto reducer parallelism feature. When enabled, Hive will still estimate data sizes and set parallelism estimates. Tez will sample source vertices' output sizes and adjust the estimates at runtime as necessary.
> set hive.tez.auto.reducer.parallelism;
> set hive.tez.auto.reducer.parallelism = true; This is the first property that determines the initial number of reducers once Tez starts the query. Then, there are two boundary parameters
hive.tex.min.partition.factor hive.tez.max.partition.factor #When auto reducer parallelism is enabled this factor will be used to put a lower limit to the number of reducers that Tez specifies.
> hive.tez.min.partition.factor=0.25;
#When auto reducer parallelism is enabled this factor will be used to over-partition data in shuffle edges.
> hive.tez.max.partition.factor=2.0; More on this parameter later.
The third property is hive.exec.reducers.max which determines the maximum number of reducers. By default it is 1099. The final parameter that determines the initial number of reducers is hive.exec.reducers.bytes.per.reducer By default hive.exec.reducers.bytes.per.reducer is set to 256MB, specifically 258998272 bytes.
The FORMULA So to put it all together Hive/ Tez estimates
number of reducers using the following formula and then schedules the Tez DAG. Max(1, Min(hive.exec.reducers.max [1099], ReducerStage estimate/hive.exec.reducers.bytes.per.reducer)) x hive.tez.max.partition.factor [2] ------------------ So in our example since the RS output is 190944 bytes, the number of reducers will be: > Max(1, Min(1099, 190944/258998272)) x 2
> Max (1, Min(1099, 0.00073)) x 2 = 1 x 2 = 2
Hence the 2 Reducers we initially observe. --------------------- 4. Increasing Number of Reducers, the Proper Way Let's set hive.exec.reducers.bytes.per.reducer to 10 MB about 10432 The new number of reducers count is > Max(1, Min(1099, 190944/10432)) x 2
> Max (1, Min(1099, 18.3)) x 2 = 19 (rounded up) x 2 = 38
Query takes 32.69 seconds now, an improvement. --------------------------------------------------- 5. More reducers does not always mean Better performance Let's set hive.exec.reducers.bytes.per.reducer to 15.5 MB about 15872 The new number of reducers count is > Max(1, Min(1099, 190944/15360)) x 2
> Max (1, Min(1099, 12)) x 2 = 12 x 2 = 24
Performance is BETTER with 24 reducers than with 38 reducers.
---------------------------- 7. Reducing number of Reducer Stages Since we have BOTH a Group By and an Order by in our query, looking at the explain plan, perhaps we can combine that into one reducer stage. The parameter for this is hive.optimize.reducededuplication.min.reducer which by default is 4. Setting this to 1, when we execute the query we get Performance is BETTER with ONE reducer stage at 15.88 s.
NOTE: Because we also had a LIMIT 20 in the statement, this worked also. When LIMIT was removed, we have to resort to estimated the right number of reducers instead to get better performance.
------------------------------------------------
Summary While we can set manually the number of reducers mapred.reduce.tasks, this is NOT RECOMMENDED set mapred.reduce.tasks = 38; Tez does not actually have a reducer count when a job starts – it always has a maximum reducer count and that's the number you get to see in the initial execution, which is controlled by 4 parameters. The 4 parameters which control this in Hive are hive.tez.auto.reducer.parallelism=true;
hive.tez.min.partition.factor=0.25;
hive.tez.max.partition.factor=2.0;
hive.exec.reducers.bytes.per.reducer=1073741824; // 1gb You can get wider or narrower distribution by messing with those last 3
parameterss (preferably only the min/max factors, which are merely guard
rails to prevent bad guesses). Hive/ Tez estimates
number of reducers using the following formula and then schedules the Tez DAG. Max(1, Min(hive.exec.reducers.max [1099], ReducerStage estimate/hive.exec.reducers.bytes.per.reducer)) x hive.tez.max.partition.factor [2] Then as map tasks finish, it inspects the output size counters for tasks
to estimate the final output size then reduces that number to a lower
number by combining adjacent reducers. The total # of mappers which have to finish, where it starts to decide and run reducers in the nest stage is determined by the following parameters. tez.shuffle-vertex-manager.min-src-fraction=0.25;
tez.shuffle-vertex-manager.max-src-fraction=0.75;
This
indicates that the decision will be made between 25% of mappers
finishing and 75% of mappers finishing, provided there's at least 1Gb of
data being output (i.e if 25% of mappers don't send 1Gb of data, we will wait till at least 1Gb is sent out). Once
a decision has been made once, it cannot be changed as some reducers
will already be running & might lose state if we do that. You can
get more & more accurate predictions by increasing the fractions. ------------------------------------ APPENDIX Hive-2.0 (only) improvements Now that we have a total # of reducers, but you might not have capacity to run all of them at the same time - so you need to pick a few to run first, the ideal situation would be to start off the reducers which have the most amount of data (already) to fetch first, so that they can start doing useful work instead of starting reducer #0 first (like MRv2) which may have very little data pending. tez.runtime.report.partition.stats=true;
tez.runtime.pipelined-shuffle.enabled=true;
The first flag there is pretty safe, but the second one is a bit more dangerous as it allows the reducers to fetch off tasks which haven't even finished (i.e mappers failing cause reducer failure, which is optimistically fast, but slower when there are failures – bad for consistent SLAs). Finally, we have the sort buffers which are usually tweaked & tuned to fit, but you can make it much faster by making those allocations lazy (i.e allocating 1800mb contigously on a 4Gb container will cause a 500-700ms gc pause, even if there are 100 rows to be processed). tez.runtime.pipelined.sorter.lazy-allocate.memory=true; Reference: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties http://hortonworks.com/blog/apache-tez-dynamic-graph-reconfiguration/ http://www.slideshare.net/t3rmin4t0r/hivetez-a-performance-deep-dive and http://www.slideshare.net/ye.mikez/hive-tuning (Mandatory)
See also http://www.slideshare.net/AltorosBY/altoros-practical-steps-to-improve-apache-hive-performance http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup http://www.slideshare.net/InderajRajBains/using-apache-hive-with-high-performance Special thanks also to Gopal for assisting me with understanding this.
... View more
Labels:
03-10-2016
06:14 PM
16 Kudos
Microsoft Azure General Sizing Guidelines You need to size and price machine and storage separately. Use Linux VMS on Azure (not to be confused with the Ubuntu Beta offering on HDInsight) If performance is a must, especially with Kafka and Storm, use Premium storage not Standard. Make sure and request Premium Storage (see link below) Do not use A8 machines. Use either A10 or A11’s. A8 is backed by Infiniband which is more expensive and unnecessary for Hadoop Recommend D Series and the newer D_v2 Series for Solid State Drives if needed. For Premium Storage use DS_v2 Series It is recommended that Page Blob Storage is used for Hbase as opposed to Block Storage. See link below. Both options will need attached Blob Storage. The 382 GB local disk that comes with the VM is just for temp storage. For
Blob Storage, it comes in 1023GB sizes. Each VM has a maximum number
of Blob Storage that can be attached. Eg. A10 Vms can have a maximum of
16 * 1TB storage. See the following for more details: Use Availability sets for master and worker nodes Use one storage account for every node in the cluster in
order to bypass IOPS limits for multiple VMs on the same Storage
Account. You can also try to use Azure Data Lake Store (with adl://) in
order to check the performance on the new Azure service. You also need to remember the maintenance windows of every Azure
region according to your customers: some regions could be a good choice
for new service availability (e.g.: US East 2) but not from a
maintenance point of view (especially for European customers) --------------------------------------- Recommendation 1 - Best Compute performance for Batch and Real Time Use Cases
For Head Master Nodes Use:
Standard_D13_v2 (8 CPU, 56GB) or Standard_D5_v2 (16 CPU, 56 GB) OR Standard_D14_v2 (16 CPU, 112 GB) For Data Nodes Use:
Standard_D14_v2 (16 CPU, 112 GB) or Standard_DS14_v2 (16 CPU, 112 GB with Premium Storage) or Standard_DS15_v2 (20 CPU, 140 GB with Premium Storage) If testing Kafka and Storm use Standard_DS13_v2, Standard_DS14_v2 or Standard_DS15_v2 with Premium Storage especially if performance is needed to meet SLAs Pros: CPU is 35% than D Series; Local SSD Disks; VMs cheaper per hour that A or D series. Recommendation 2 - Good Compute performance
Use Standard_D13 (8 CPU, 56GB) or Standard_D14 (16 CPU, 112 GB) for Head/ Master nodes and Standard_D14 (16 CPU, 112 GB) for Data Nodes If testing Kafka and Storm use Standard_DS13 (8 CPU, 56GB) or Standard_DS14 (16 CPU, 112 GB) with Premium Storage especially if performance is needed to meet SLAs Pros: 60% faster than A series; Local SSD Disks; Why pick this if it is slightly more expensive per hour than D_v2 Series Recommendation 3 - Mostly for Batch performance
Use A10 or A11 for Head/ Master nodes and A11 for Data Nodes Microsoft is pricing effectively so that you use the D-v2 Series ------------ Microsoft Links Storage Pricing: https://azure.microsoft.com/en-us/pricing/details/storage/ Premium Storage: https://azure.microsoft.com/en-us/documentation/articles/storage-premium-storage/ VM pricing: https://azure.microsoft.com/en-us/pricing/details/virtual-machines/#Linux VM Size Specs: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-size-specs/ Page vs. Blob Storage: https://hadoop.apache.org/docs/current/hadoop-azure/index.html#Page_Blob_Support_and_Configuration Azure Data Lake: https://azure.microsoft.com/en-us/blog/introducing-azure-data-lake/
... View more
02-29-2016
03:02 PM
Additional white papers from EMC: http://www.criticism.com/white-papers/white-papers.php Latest EMC Best Practices January 2015 Version: https://www.emc.com/collateral/white-papers/h13926-wp-emc-isilon-hadoop-best-practices-onefs72.pdf
... View more
02-13-2016
04:50 PM
57 Kudos
This article is for those who want a cheat sheet for a smooth installation of HDP in a Dev, or Test with one or more of the following requirements:
Place all the log data into a different directory, not /var/log
All your service user names must be prefixed with the cluster name. The requirement is that these users must be centrally managed by AD or an LDAP.
You do not have any local users in the Hadoop cluster, including Hadoop service users. This becomes important if you wish to have Centrify deployed also, or if you would be deploying multiple clusters with a single LDAP/ AD integration. Once again, these service names should have a cluster-prefix.
You want to set appropriate YARN, Tez and MapReduce, Amabri Metrics Memory Parameters during Install.
Side Note: It is always prudent to get Professional Service assistance to either install or configure your production deployment, to make sure all the per-requisites, unique to your environment are covered and met.
--------------------------------------------------------------------------------------------------------
Step 1: Do Your Research..... Plan, Plan, Plan, Do it Right the First time, or Risk Doing it Over, and Over Again
This article is not intended to replace the Hortonworks docs or all the excellent resources here in HCC or elsewhere.
Apart from the Hortonworks docs, review:
Hortonworks Operational Best Practices Webcast and Slides
Typical Hadoop Cluster Networking Practices
Best Practice Linux File System for Hadoop and ext4 vs. XFS
Yarn Directories Recommended Size and Disk.
Best Practice Zookeeper Placement
Best Practice for Storm and Kafka Deployment and Unofficial Storm and Kafka Best practices Guide
Name Node Garbage Collection Best Practice
Tools to test the Performance, Scale and Reliability of Your Cluster
--------------------------------------------------------------------------------------------------------
Step 2: Get your Disk partitions Right
See the following for some guidance. Take note of the hadoop properties and default locations. You need to have this done ahead of time.
Disk Partition Baseline
--------------------------------------------------------------------------------------------------------
Name Nodes Disk Partitioning
--------------------------------------------------------------------------------------------------------
Data Nodes Disk Partition
--------------------------------------------------------------------------------------------------------
Ambari/ Edge/ Ranger/ Knox Nodes Disk Partition
--------------------------------------------------------------------------------------------------------
Storm and Kafka Nodes Disk Partition
--------------------------------------------------------------------------------------------------------
Step 3: Don't Scrimp on Master Nodes. Know the Placement of Your Master Services
If you want to do yourself an injustice, just allocate one or two master nodes.
If you want to do things properly, and you want to be set for up to 50 nodes, then please have at least 3 master nodes, better 4, if you doing HA, with at least 1 Edge and 1 Admin/ Ambari Server.
It is a PAIN and some effort involved to move master services if you don't get it right.
Figure out where you placing your Master Services. Use the following as a Guide:
--------------------------------------------------------------------------------------------------------
Step 4: Get a Dedicated Database Server with HA for Ambari, Hive, Metastore, Oozie, Ranger
Oozie by default installs on Derby. You do not want Derby in your cluster.
Ambari by default installs on Postgres. You can decide to keep it there.
Hive's metastore uses MySQL. You can use a dedicated MySQL Database for Hive, Ranger Admin, and Oozie. Bear in mind though that if you restart Hive's metastore, it may affect Ranger and Oozie.
The instructions for setting up the databases before an Ambari install is located at Using Non Default Databases
--------------------------------------------------------------------------------------------------------
Step 5: Create Service Accounts Beforehand in your LDAP
Decide what you rcluster prefix would be. Do not put an underscore "_" or a hyhen "-" in your prefix.
The list of service accounts you need to create are located here.
Solr is missing from the list. You need this user if you want to install Ranger, for Ranger uses Solr from HDP 2.3 and above for auditing and to show audit events in the UI.
Create a solr user with default group solr, with membership in the hadoop group also.
IMPORTANT: On each node, get the AD or LDAP UID for hdfs, and group hadoop; edit the /etc/passwd and /etc/groups and add the users there with the CORRECT UID fom AD or Ldap. I have found that even though you choose the option to
Skip Group Modifications to not modify the Linux groups in the cluster, and you tell Ambari to do not Manage HDFS, some of the yum installs still tries to create the, Ambari would respect your wishes but not yum.
Make sure the entries in your /etc/passwd and /etc/groups have your cluster prefix.
When you install through Ambari it is very important that you config the right properties so that Ambari is aware of your centrally managed cluster-prefixed service names:
Set Skip Group Modification
Tell Ambari DO not Manage HDFS
Follow the instructions at
Setting properties that depend on service usernamesgroups
There is one property missing from the doc.
Also set HDFS User to your <cluster-prefix>-hdfs also in Advanced hadoop-env.
--------------------------------------------------------------------------------------------------------
Step 6: Use Hortonworks Handy Scripts to Automatically Prepare the Environment Across all Nodes
So you have your disk partitions, your network is setup, you have decided on your master services placement, you have created the service names in LDAP with a cluster prefix, you have edited your /etc/passwd and /etc/groups.
Here comes the fun part.
Go to your Ambari node and perform the following:
# Install Hortonworks Public Tools
> yum install wget
> wget –qO- --no-check-certificate https://github.com/hortonworks/HDP-Public-Utilities/raw/master/Installation/install_tools.sh | bash
>./install.sh
>cd hdp
#Everything will be installed to /root/hdp; create the /root/hdp/Hostdetail.txt file with all the hostnames for your cluster.
# Hostname –f > /root/hdp/Hostdetail.txt
vi /root/hdp/Hostdetail.txt
#To set up Password-less SSH
> ssh-keygen
>chmod 700 ~/.ssh
>chmod 600 ~/.ssh/id_rsa
# Distribute the keys to other nodes. The copy command is needed because the ./distribute_ssh_keys.sh script thinks the private key is at /tmp/ec2_keypair. Else if you set up your nodes with a root passwrd, when prompted by the script, just enter it.
> cp <your nodes private key> /tmp/ec2_keypair
> ./distribute_ssh_keys.sh ~/.ssh/id_rsa.pub
#Optional: Copy the private key to all nodes if you want password less ssh from any node to any node. Don't do this, if you only want password-less ssh ONLY from the Ambari Node. Password-less ssh is only needed for Ambari to install Agents on all nodes, else without it you need to install the Agents and configure them yourself.
>./copy_file ~/.ssh/id_rsa ~/.ssh/id_rsa
# Test passwordless SSH
> ssh <node>
#Now run a script to set all the OS pre-requisites for a cluster install. You may have to edit ./run_command.sh and add to the ssh command, ssh -tty, since the ./hdp_preinstall.sh script has sudo commands in it.
> ./run_command.sh 'mkdir /root/hdp'
> ./copy_file.sh /root/hdp/hdp_preinstall.sh /root/hdp/hdp_preinstall.sh
> vi run_command.sh (add "-tty" to the ssh call)
# Now in one swoop set the OS parameters
> ./run_command.sh './root/hdp/hdp_preinstall.sh'
REBOOT ALL NODES
#DOUBLE CHECK That all the Nodes retain all the OS Environment Configuration Changes for HDP Install
> ./pre_install_check.sh | tee report.txt
#View the report. Ignore the Repo warnings for Ambari and HDP, if you are connected to internet and you will pull the repos from there duing install.
> vi report.txt
# Now get your YARN Parameters to use when you install the cluster via Ambari
# Download Hortonworks Companion files
> wget http://public-repo-1.hortonworks.com/HDP/tools/2.3.4.0/hdp_manual_install_rpm_helper_files-2.3.4.0.3485.tar.gz
> tar -zxvf hdp_manual_install_rpm_helper_files-2.3.4.0.3485.tar.gz
> cd hortonworks-HDP-Public-Utilities-d617f44
# Now run the Script to determine your memory parameters that you would set in Ambari during the Customize Services Step. Put your Number of Cores (c), Memory per Node (m), Disks per Node for HDFS (d) and Whether HBase will be installed or not (-k) into the python call
>python yarn-utils.py -c 16 -m 64 -d 4 -k True
See Determine YARN and HDP memory
Make a note of these memory settings to to plug in during Ambari Install.
--------------------------------------------------------------------------------------------------------
Step 7: Installing Ambari
Now you start install Ambari and HDP from the doc at
http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_Installing_HDP_AMB/content/_using_a_local_repository.html
Don't forget about setting your cluster-prefixed service name for hdfs and hbase
Don't choose a cluster name that has an underscore (_) because HDFS HA does not like it.
Don't forget to change the locations as per the Disk Partition diagrams above of all
You can change the directory for Hadoop logs upon install if you wish. See https://community.hortonworks.com/questions/4329/log-file-location-is-there-a-way-to-change-varlog.html
Don't forget to set the YARN and MapReduce Memory Parameters found from the python script.
Don't Forget to set the name Node Garbage Collection.
You can do the following to get Ambari running better during install: http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_ambari_reference_guide/content/ch_tuning_ambari_performance.html
During Install you can configure Ambari Metrics: See https://cwiki.apache.org/confluence/display/AMBARI/Configurations+-+Tuning and http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_ambari_reference_guide/content/_ams_general_guidelines.html
You can follow this to tune Tez During the Install. See https://community.hortonworks.com/articles/14309/demystify-tez-tuning-step-by-step.html
IMPORTANT: For less that 10 Data Nodes
Set mapred.submit.replication =3 in mapred-site.xml
This is to prevent the job related staging files to be created with default replication factor of 10, which would lead to under-replicated block warnings.
--------------------------------------------------------------------------------------------------------
Step 8 Install SmartSense, only offered by Hortonworks.
Finally INSTALL SMARTSENSE, if you are a Hortonworks Customer. If you are not, why NOT? You are missing all the value from SmartSense to auto tune your cluster. (In Ambari 2.2 it is available as a Service.)
--------------------------------------------------------------------------------------------------------
Step 9 Security Tips
If you plan to install Ranger, INSTALL SOLR FIRST. Don't Add the Ranger Service as yet after you install the cluster.
Make sure that you use the <cluster-prefix>-solr user in your install, so that the proces runs under that user
Enable Kerberos if you can BEFORE adding Ranger. If not, that is fine, you would have to configure Ranger and all the plug ins after the fact, but it is easier if you enable Kerberos first.
Storm, Kafka, Solr Needs Kerberos before you authorize with Ranger
There is no Security without Kerberos.
--------------------------------------------------------------------------------------------------------
Finally
Most issues are due to a rouge process running having a local uid and not the LDAP, AD UID, so double check using ps -ef. If you set up your /etc/psswd and /etc/group properly before hand, you should not have this issue.
Some issues come up if your files and/ or logs are owned by the local hdfs user. Again if you did not choose the 'Skip Group Modification' option, and told Ambari to not manage HDFS, or set the hdfs user properly during install to the <cluster-prefix>-hdfs, or setup your /etc/psswd and /etc/group you would get this problem.
Remember some yum installs do not care what you set in Ambari for the hdfs user, so you may have to run those manually, so look out for that.
--------------------------------------------------------------------------------------------------------
Update:
A good resource:
https://martin.atlassian.net/wiki/pages/viewpage.action?pageId=45580306
https://community.hortonworks.com/questions/21405/where-to-write-fsimage-files-when-running-qjm-nn-h.html
... View more
Labels: