Member since
01-19-2017
3676
Posts
632
Kudos Received
372
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 528 | 06-04-2025 11:36 PM | |
| 1062 | 03-23-2025 05:23 AM | |
| 552 | 03-17-2025 10:18 AM | |
| 2057 | 03-05-2025 01:34 PM | |
| 1289 | 03-03-2025 01:09 PM |
12-30-2019
10:12 AM
@RobertCare Check the value in Ambari---> Ranger---> Configs---> Advanced---> Advanced *ranger-ugsync-site * if that is appropriate for the HDP version else adjust ranger.usersync.policymanager.mockrun=true means usersync is disabled as its a mock run set it to false this will trigger the usersync if not can you share the ambari-server log? Ranger could also be looking for attribute uid if your users have cn rather than uid it did retrieve the users and groups from LDAP but not insert them in the database Hope that helps
... View more
12-28-2019
02:43 AM
@sheelstera There is a great YARN tuning spreadsheet here that will help you calculate correctly your YARN settings. It applies to YARN clusters only, and describes how to tune and optimize YARN for your cluster Please revert
... View more
12-28-2019
02:26 AM
1 Kudo
@Cl0ck That is possible please follow this cloudera documentation please have a look and see if it suits your situation. Keep me posted.
... View more
12-28-2019
02:13 AM
@saivenkatg55 Sorry festive period, can you do the following. Delete old messages in /var/log/messages all that have the extension /var/log/messages.x that should leave you with only one /var/log/messages then truncate that file so you will have only new entries # truncate --size 0 /var/log/messages Do the same for /var/log/hadoop-yarn/yarn/hadoop-yarn-nodemanager-<node_name>.log.x and also truncate the /var/log/hadoop-yarn/yarn/hadoop-yarn-nodemanager-<node_name>.log # truncate --size 0 /var/log/hadoop-yarn/yarn/hadoop-yarn-nodemanager-<node_name>.log Start manually the node manager # su -l yarn -c "/usr/hdp/current/hadoop-yarn-nodemanager/sbin/yarn-daemon.sh start nodemanager" Then share the latest files created below /var/log/messages /var/log/hadoop-yarn/yarn/hadoop-yarn-nodemanager-<node_name>.log /var/lib/ambari-agent/data/errors-xxx.txt Please revert
... View more
12-27-2019
04:33 PM
1 Kudo
@ssk26 I successfully configured the fair scheduler on the below HDP version Original scheduler YARN UI Default capacity scheduler after deployment of HDP Pre-emption enabled before the change to fair-scheduler Grabbed the template fair-scheduler.xml fair-scheduler here I then changed a few values for testing purposes but ensured the is valid using the XML using XML Validator I then copied the fair-scheduler.xml to the $HADOOP_CONF directory and changed the user & permission # cd /usr/hdp/3.1.0.0-78/hadoop/conf # chown hdfs:hadoop fair-scheduler.xml # chmod 644 fair-scheduler.xml Changed the Scheduler class in the yarn-site.xml see the attached screenshot. From : yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler To: yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler Added these new values in Custom yarn-site using the relative path default to /usr/hdp/3.1.0.0-78/hadoop/conf yarn.scheduler.fair.allocation.file=fair-scheduler.xml Custom yarn-site.xml Changed the below mandatory parameter to enable the ReservationSystem in the ResourceManager is not enabled by default yarn.resourcemanager.reservation-system.enable=true Disable pre-emption Set the below properties as shown The yarn-site.xml file contains parameters that determine scheduler-wide options. These properties include the below properties if they don't exist add them in the custom yarn-site Note The below property was available so I didn't add yarn.scheduler.capacity.ordering-policy.priority-utilization.underutilized-preemption.enabled=false Properties to verify For my testing I didn't add the below properties, you will notice that above that despite disabling the pre-emption in the Ambari UI the fair schedule shows it's enabled [True] and my queues ain't showing I need to check my fair-scheduler.xml attached is the template I used yarn.scheduler.fair.assignmultiple=false yarn.scheduler.fair.sizebasedweight=false yarn.scheduler.fair.user-as-default-queue=true yarn.scheduler.fair.preemption=false Note: Do not use preemption when FairScheduler DominantResourceFairness is in use and node labels are present. All in all, this shows the fair-scheduler configuration is doable and my RM is up and running !! I also noticed that the above fair-scheduler was template overwritten when I checked the YARN Queue Manager so that can now allow me to configure a new valid fair-scheduler Happy Hadooping
... View more
12-27-2019
07:52 AM
@Prakashcit There is a Jira https://issues.apache.org/jira/browse/HIVE-16575 last updated on 05/Dec/19 Hive does not enforce foreign keys to refer to primary keys or unique keys. In your previous thread, I explained what a NOVALIDATE constratriant is "A NOVALIDATE constraint is basically a constraint that can be enabled but for which hive will not check the existing data to determine whether there might be data that currently violate the constraint" The difference between a UNIQUE constraint and a Primary Key is that per table you may only have one Primary Key but you may define more than one UNIQUE constraint. Primary Key constraints are not nullable.UNIQUE constraints may be nullable. Oracle also implements the NOVALIDATE constraint here is a write-up by Richard Foote When you create a UNIQUE constraint, the database automatically creates a UNIQUE index. For RDBMS databases, a PRIMARY KEY will generate a unique CLUSTERED INDEX. A UNIQUE constraint will generate a unique NON-CLUSTERED INDEX.
... View more
12-26-2019
12:30 PM
@Prakashcit To ensure data from multiple data sources are ingested to discover at a later stage business insights, usually we dump everything. Comparison of source data with data ingested to simply validate that all the data has been pushed and verifying that correct data files are generated and loaded into HDFS correctly into the desired location. A smart data lake ingestion tool or solution like kylo should enable self-service data ingestion, data wrangling, data profiling, data validation, data cleansing/standardization,see attached architecture /landing_Zone/Raw_data/ [ Corresponding to stage1] /landing_Zone/Raw_data/refined [ Corresponding to stage2] /landing_Zone/Raw_data/refined/Trusted Data [ Corresponding to stage3] /landing_Zone/Raw_data/refined/Trusted Data/sandbox [ Corresponding to stage4] The data lake can be used also to feed upstream systems for a real-time monitoring system or long storage like HDFS or hive for analytics Data quality is often seen as the unglamorous component of working with data. Ironically, it’s usually the component that makes up the majority of our time of data engineers. Data quality might very well be the single most important component of a data pipeline, since, without a level of confidence and reliability in your data, the dashboard and analysis generated from the data is useless. The challenge with data quality is that there are no clear and simple formulas for determining if data is correct this is a continuous data engineering task as more data sources are incorporated to the data pipeline. Typically hive plugged on stage 3 and tables are created after the data validation of stage 2 this ensures that data scientists have cleansed data to run their models and analysts using BI tools at least this has been the tasks I have done all through many projects HTH
... View more
12-26-2019
02:10 AM
@hud When you run nifi as a microservice then you can configure PVC [Persistent Volume Claims] using helm in AKS or Kubernetes which will ensure that evenif the nifi pods restarts it will always have the same volume mounted. Under the persitence configuration the parameter persistence.enabled should be set to true see Helm Chart for Apache Nifi HTH
... View more
12-25-2019
12:38 PM
@saivenkatg55 For the screenshots, the 2 notebooks you create are Untitled Note 1 and Untitled Note 2 which should appear in the drop-down list under Notebook on the top menu. Below I will create a Spark interpreter notebook named saivenkatg55 from step 2 above This should appear on the under the notebook Launched a test I can see the job was accepted and running in RM UI So where exactly are you encountering issues? Happy hadooping !
... View more
12-25-2019
09:44 AM
@kiranpune DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery and reporting. It expands a list of files and directories into the input to map tasks, each of which will copy a partition of the files specified in the source list that basic description. But one can use different command-line options when running DISTCP see the official dictcp documentation below are a few options for your different use cases. OPTIONS -append: Incremental copy of the file with the same name but different length -update: Overwrite if source and destination differ in size, block size, or checksum -overwrite: Overwrite destination -delete: Delete the files existing in the destination but not in the source I think you can schedule or script a daily copy
... View more