Member since
01-19-2017
3679
Posts
632
Kudos Received
372
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 998 | 06-04-2025 11:36 PM | |
| 1567 | 03-23-2025 05:23 AM | |
| 783 | 03-17-2025 10:18 AM | |
| 2817 | 03-05-2025 01:34 PM | |
| 1860 | 03-03-2025 01:09 PM |
01-01-2020
05:17 AM
@ssk26 A queueMaxAMShareDefault and maxAMShare are mutually exclusive as its overridden by maxAMShare element in each queue. Can you decrease it to queueMaxAMShareDefault or maxAMShare to 0.1 and weight to 2.0 For the spark create the fairscheduler.xml from the fairscheduler.xml.template your path might be different due to version 3.1.x.x.x. # cp /usr/hdp/3.1.x.x-xx/etc/spark2/conf/fairscheduler.xml.template fairscheduler.xml Please check the file permission Then set spark.scheduler.allocation.file property in your SparkConf or either by putting a file named fairscheduler.xml on the classpath. Note if no pools configured in the XML file will simply get default values for all settings (scheduling mode FIFO, weight 1, and minShare 0). Here there are 2 default pools in fairscheduler.xml.template notably production and test using FAIR and FIFO <allocations> <pool name="production"> <schedulingMode>FAIR</schedulingMode> <weight>1</weight> <minShare>2</minShare> </pool> <pool name="test"> <schedulingMode>FIFO</schedulingMode> <weight>2</weight> <minShare>3</minShare> </pool> </allocations> Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool “local property” to the SparkContext in the thread that’s submitting them. This is done as follows: // Assuming sc is your SparkContext variable to pick the FAIR sc.setLocalProperty("spark.scheduler.pool", "production") Please let me know
... View more
01-01-2020
04:16 AM
@alialghamdi Your issue is being generated by the python script /usr/lib/ambari-agent/lib/ambari_agent/ConfigurationBuilder.py see line 38 "host_level_params_cache = self.host_level_params_cache[cluster_id" Solution 1 on node 6 Delete the tmp files to empty the cache on node 6 after stopping the ambari-agent node6 # ambari-agent stop node6 # rm -rf /var/lib/ambari-agent/* Then restart the ambari-agent on host 6 node6 # ambari-agent start Solution 2 on node 6 node6 # ambari-agent stop yum erase ambari-agent rm -rf /var/lib/ambari-agent rm -rf /var/run/ambari-agent rm -rf /usr/lib/amrbari-agent rm -rf /etc/ambari-agent rm -rf /var/log/ambari-agent rm -rf /usr/lib/python2.6/site-packages/ambari* Re-install of Ambari Agent yum install ambari-agent # Change hostname to point to the Ambari Server vi /etc/ambari-agent/conf/ambari-agent.ini Start the ambari-agent agent # ambari-agent start Please revert
... View more
12-31-2019
07:57 PM
@alialghamdi I have an idea, depending on your backend Ambari database please first do a backup. We are not going to do any changes yet but validate my suspicion DB backup Assuming you are on MySQL /MariaDB mysqldump –u[user name] –p[password] [database name] > [dump file] Check cluster state select * from clusterstate; The value found above should be there in Stage table's "cluster_id" columns select stage_id, request_id, cluster_id from stage; Identify troublesome host select host_id,host_name from hosts; Assuming you got host id 3 for the troublesome host select cluster_id,component_name from hostcomponentdesiredstate where host_id=3; select cluster_id,component_name from hostcomponentstate where host_id=3; select cluster_id,service_name from hostconfigmapping where host_id=3; Share your output for all the above steps, please tokenize your hostname.domain
... View more
12-30-2019
10:12 AM
@RobertCare Check the value in Ambari---> Ranger---> Configs---> Advanced---> Advanced *ranger-ugsync-site * if that is appropriate for the HDP version else adjust ranger.usersync.policymanager.mockrun=true means usersync is disabled as its a mock run set it to false this will trigger the usersync if not can you share the ambari-server log? Ranger could also be looking for attribute uid if your users have cn rather than uid it did retrieve the users and groups from LDAP but not insert them in the database Hope that helps
... View more
12-28-2019
02:43 AM
@sheelstera There is a great YARN tuning spreadsheet here that will help you calculate correctly your YARN settings. It applies to YARN clusters only, and describes how to tune and optimize YARN for your cluster Please revert
... View more
12-28-2019
02:26 AM
1 Kudo
@Cl0ck That is possible please follow this cloudera documentation please have a look and see if it suits your situation. Keep me posted.
... View more
12-28-2019
02:13 AM
@saivenkatg55 Sorry festive period, can you do the following. Delete old messages in /var/log/messages all that have the extension /var/log/messages.x that should leave you with only one /var/log/messages then truncate that file so you will have only new entries # truncate --size 0 /var/log/messages Do the same for /var/log/hadoop-yarn/yarn/hadoop-yarn-nodemanager-<node_name>.log.x and also truncate the /var/log/hadoop-yarn/yarn/hadoop-yarn-nodemanager-<node_name>.log # truncate --size 0 /var/log/hadoop-yarn/yarn/hadoop-yarn-nodemanager-<node_name>.log Start manually the node manager # su -l yarn -c "/usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh start nodemanager" Then share the latest files created below /var/log/messages /var/log/hadoop-yarn/yarn/hadoop-yarn-nodemanager-<node_name>.log /var/lib/ambari-agent/data/errors-xxx.txt Please revert
... View more
12-27-2019
04:33 PM
1 Kudo
@ssk26 I successfully configured the fair scheduler on the below HDP version Original scheduler YARN UI Default capacity scheduler after deployment of HDP Pre-emption enabled before the change to fair-scheduler Grabbed the template fair-scheduler.xml fair-scheduler here I then changed a few values for testing purposes but ensured the is valid using the XML using XML Validator I then copied the fair-scheduler.xml to the $HADOOP_CONF directory and changed the user & permission # cd /usr/hdp/3.1.0.0-78/hadoop/conf # chown hdfs:hadoop fair-scheduler.xml # chmod 644 fair-scheduler.xml Changed the Scheduler class in the yarn-site.xml see the attached screenshot. From : yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler To: yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler Added these new values in Custom yarn-site using the relative path default to /usr/hdp/3.1.0.0-78/hadoop/conf yarn.scheduler.fair.allocation.file=fair-scheduler.xml Custom yarn-site.xml Changed the below mandatory parameter to enable the ReservationSystem in the ResourceManager is not enabled by default yarn.resourcemanager.reservation-system.enable=true Disable pre-emption Set the below properties as shown The yarn-site.xml file contains parameters that determine scheduler-wide options. These properties include the below properties if they don't exist add them in the custom yarn-site Note The below property was available so I didn't add yarn.scheduler.capacity.ordering-policy.priority-utilization.underutilized-preemption.enabled=false Properties to verify For my testing I didn't add the below properties, you will notice that above that despite disabling the pre-emption in the Ambari UI the fair schedule shows it's enabled [True] and my queues ain't showing I need to check my fair-scheduler.xml attached is the template I used yarn.scheduler.fair.assignmultiple=false yarn.scheduler.fair.sizebasedweight=false yarn.scheduler.fair.user-as-default-queue=true yarn.scheduler.fair.preemption=false Note: Do not use preemption when FairScheduler DominantResourceFairness is in use and node labels are present. All in all, this shows the fair-scheduler configuration is doable and my RM is up and running !! I also noticed that the above fair-scheduler was template overwritten when I checked the YARN Queue Manager so that can now allow me to configure a new valid fair-scheduler Happy Hadooping
... View more
12-27-2019
07:52 AM
@Prakashcit There is a Jira https://issues.apache.org/jira/browse/HIVE-16575 last updated on 05/Dec/19 Hive does not enforce foreign keys to refer to primary keys or unique keys. In your previous thread, I explained what a NOVALIDATE constratriant is "A NOVALIDATE constraint is basically a constraint that can be enabled but for which hive will not check the existing data to determine whether there might be data that currently violate the constraint" The difference between a UNIQUE constraint and a Primary Key is that per table you may only have one Primary Key but you may define more than one UNIQUE constraint. Primary Key constraints are not nullable.UNIQUE constraints may be nullable. Oracle also implements the NOVALIDATE constraint here is a write-up by Richard Foote When you create a UNIQUE constraint, the database automatically creates a UNIQUE index. For RDBMS databases, a PRIMARY KEY will generate a unique CLUSTERED INDEX. A UNIQUE constraint will generate a unique NON-CLUSTERED INDEX.
... View more
12-26-2019
12:30 PM
@Prakashcit To ensure data from multiple data sources are ingested to discover at a later stage business insights, usually we dump everything. Comparison of source data with data ingested to simply validate that all the data has been pushed and verifying that correct data files are generated and loaded into HDFS correctly into the desired location. A smart data lake ingestion tool or solution like kylo should enable self-service data ingestion, data wrangling, data profiling, data validation, data cleansing/standardization,see attached architecture /landing_Zone/Raw_data/ [ Corresponding to stage1] /landing_Zone/Raw_data/refined [ Corresponding to stage2] /landing_Zone/Raw_data/refined/Trusted Data [ Corresponding to stage3] /landing_Zone/Raw_data/refined/Trusted Data/sandbox [ Corresponding to stage4] The data lake can be used also to feed upstream systems for a real-time monitoring system or long storage like HDFS or hive for analytics Data quality is often seen as the unglamorous component of working with data. Ironically, it’s usually the component that makes up the majority of our time of data engineers. Data quality might very well be the single most important component of a data pipeline, since, without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless. The challenge with data quality is that there are no clear and simple formulas for determining if data is correct this is a continuous data engineering task as more data sources are incorporated to the data pipeline. Typically hive plugged on stage 3 and tables are created after the data validation of stage 2 this ensures that data scientists have cleansed data to run their models and analysts using BI tools at least this has been the tasks I have done all through many projects HTH
... View more