About aarongrubb

aarongrubb · ‎08-18-2023

This was caused by me overlooking "root" as an actual queue and not giving it the proper permissions for label and capacity to pass on to the child queues. The configuration in the writeup here tipped me off: https://www.ibm.com/support/pages/yarn-node-labels-label-based-scheduling-and-resource-isolation-hadoop-dev Here is the full configuration that gives me the desired behaviour: <configuration> <property> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>1.0</value> </property> <property> <name>yarn.scheduler.capacity.resource-calculator</name> <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> </property> <property> <name>yarn.scheduler.capacity.root.accessible-node-labels</name> <value>*</value> </property> <property> <name>yarn.scheduler.capacity.root.capacity</name> <value>100</value> </property> <property> <name>yarn.scheduler.capacity.root.maximum-capacity</name> <value>100</value> </property> <property> <name>yarn.scheduler.capacity.root.accessible-node-labels.node.capacity</name> <value>100</value> </property> <property> <name>yarn.scheduler.capacity.root.accessible-node-labels.node.maximum-capacity</name> <value>100</value> </property> <property> <name>yarn.scheduler.capacity.root.queues</name> <value>default,spark</value> </property> <property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>[memory=11776,vcores=4]</value> </property> <property> <name>yarn.scheduler.capacity.root.default.maximum-capacity</name> <value>[memory=11776,vcores=4]</value> </property> <property> <name>yarn.scheduler.capacity.root.default.accessible-node-labels</name> <value>node</value> </property> <property> <name>yarn.scheduler.capacity.root.default.default-node-label-expression</name> <value>node</value> </property> <property> <name>yarn.scheduler.capacity.root.default.accessible-node-labels.node.capacity</name> <value>[memory=11776,vcores=4]</value> </property> <property> <name>yarn.scheduler.capacity.root.default.accessible-node-labels.node.maximum-capacity</name> <value>[memory=11776,vcores=4]</value> </property> <property> <name>yarn.scheduler.capacity.root.default.default-application-priority</name> <value>9</value> </property> <property> <name>yarn.scheduler.capacity.root.default.disable_preemption</name> <value>true</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.capacity</name> <value>[memory=4096,vcores=1]</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.maximum-capacity</name> <value>[memory=4096,vcores=1]</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.accessible-node-labels</name> <value>node</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.accessible-node-labels.node.capacity</name> <value>[memory=4096,vcores=1]</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.accessible-node-labels.node.maximum-capacity</name> <value>[memory=4096,vcores=1]</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.default-application-priority</name> <value>9</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.disable_preemption</name> <value>true</value> </property> </configuration>

aarongrubb · ‎08-11-2023

Hello, I'm new to using labels on YARN nodes. I have successfully set up the labels but the scheduler is allocating all resources to the DEFAULT_PARTITION under "Effective Capacity" and 0 resources to the labeled partition. As the screenshots illustrate, the NodeManager is launching with the correct label and has the correct resources assigned to that label, however, applications will not start when assigned to that label because although the partition has resources assigned to it, the queue under the partition does not. Here's my capacity-scheduler.xml: <configuration> <property> <name>yarn.scheduler.capacity.maximum-am-resource-percent</name> <value>1.0</value> </property> <property> <name>yarn.scheduler.capacity.resource-calculator</name> <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> </property> <property> <name>yarn.scheduler.capacity.root.queues</name> <value>default,spark</value> </property> <property> <name>yarn.scheduler.capacity.root.default.capacity</name> <value>[memory=11776,vcores=4]</value> </property> <property> <name>yarn.scheduler.capacity.root.default.maximum-capacity</name> <value>[memory=11776,vcores=4]</value> </property> <property> <name>yarn.scheduler.capacity.root.default.accessible-node-labels</name> <value>node</value> </property> <property> <name>yarn.scheduler.capacity.root.default.accessible-node-labels.node.capacity</name> <value>[memory=11776,vcores=4]</value> </property> <property> <name>yarn.scheduler.capacity.root.default.accessible-node-labels.node.maximum-capacity</name> <value>[memory=11776,vcores=4]</value> </property> <property> <name>yarn.scheduler.capacity.root.default.default-node-label-expression</name> <value>node</value> </property> <property> <name>yarn.scheduler.capacity.root.default.default-application-priority</name> <value>9</value> </property> <property> <name>yarn.scheduler.capacity.root.default.disable_preemption</name> <value>true</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.capacity</name> <value>[memory=4096,vcores=1]</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.maximum-capacity</name> <value>[memory=4096,vcores=1]</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.accessible-node-labels</name> <value>node</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.accessible-node-labels.node.capacity</name> <value>[memory=4096,vcores=1]</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.accessible-node-labels.node.maximum-capacity</name> <value>[memory=4096,vcores=1]</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.default-application-priority</name> <value>9</value> </property> <property> <name>yarn.scheduler.capacity.root.spark.disable_preemption</name> <value>true</value> </property> </configuration> And here is the relevant parts of yarn-site.xml: <property> <name>yarn.node-labels.enabled</name> <value>true</value> </property> <property> <name>yarn.node-labels.configuration-type</name> <value>distributed</value> </property> <property> <name>yarn.node-labels.fs-store.root-dir</name> <value>hdfs://xxx:9000/user/yarn/node-labels/</value> </property> <property> <name>yarn.nodemanager.node-labels.provider</name> <value>config</value> </property> <property> <name>yarn.nodemanager.node-labels.provider.configured-node-partition</name> <value>node</value> </property> I'm using Hadoop 3.3.4 built from source. In case it matters, this is in my dev environment with a single ResourceManager and NodeManager. Any suggestions are much appreciated. Thanks!

aarongrubb · ‎11-08-2019

This problem is caused by "mapreduce.framework.name=local" (default in Hadoop 3.2.1). Solved with "set mapreduce.framework.name=yarn".

aarongrubb · ‎11-06-2019

I'm running a from-scratch cluster on AWS EC2. I have an external table (partitioned) defined with data on S3. I'm able to query this table and receive results to the console with a simple select * statement: hive> set hive.execution.engine=tez; hive> select * from external_table where partition_1='1' and partition_2='2'; [correct results returned] Running a query that requires Tez doesn't return the results to the console: hive> set hive.execution.engine=tez; hive> select count(*) from external_table where partition_1='1' and partition_2='2'; Status: Running (Executing on YARN cluster with App id application_1572972524483_0012) OK +------+ | _c0 | +------+ +------+ No rows selected (8.902 seconds) However, if I dig in the logs and on the filesystem, I can find the results from that query: (yarn.resourcemanager.log) org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1572972524483_0022 CONTAINERID=container_1572972524483_0022_01_000002 RESOURCE=<memory:1024, vCores:1> QUEUENAME=default (container_folder/syslog_attempt) [TezChild] |exec.FileSinkOperator|: New Final Path: FS file:/tmp/[REALLY LONG FILE PATH]/000000_0 [root #] cat /tmp/[REALLY LONG FILE PATH]/000000_0 SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Textl▒ꩇ1som}▒▒j¹▒ 2060 2060 is the correct count for the partition. Now, oddly enough, I'm able to get the results from the application if I insert overwrite directory on HDFS: hive> set hive.execution.engine=tez; hive> INSERT OVERWRITE DIRECTORY '/tmp/local_out' select count(*) from external_table where partition_1='1' and partition_2='2'; [root #] hdfs dfs -cat /tmp/local_out/000000_0 2060 However, attempting to insert overwrite local directory fails: hive> set hive.execution.engine=tez; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' select count(*) from external_table where partition_1='1' and partition_2='2'; [root #] cat /tmp/local_out/000000_0 cat: /tmp/local_out/000000_0: No such file or directory If I cat the container result file for this query, it's only the number, no class name or special characters: [root #] cat /tmp/[REALLY LONG FILE PATH]/000000_0 2060 The only out-of-place log message I can find comes from the YARN ResourceManager log: (yarn.resourcemanager.log) INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1572972524483_0023 CONTAINERID=container_1572972524483_0023_01_000004 RESOURCE=<memory:1024, vCores:1> QUEUENAME=default (yarn.resourcemanager.log) WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root IP=NMIP OPERATION=AM Released Container TARGET=Scheduler RESULT=FAILURE DESCRIPTION=Trying to release container not owned by app or with invalid id. PERMISSIONS=Unauthorized access or invalid container APPID=application_1572972524483_0023 CONTAINERID=container_1572972524483_0023_01_000004 I've also tried creating a table and inserting data into it. The table creates just fine but when I tried to insert data, it throws an error: hive> set hive.execution.engine=tez; hive> insert into test_table (test_col) values ('blah'), ('blahblah'); Query ID = root_20191106172949_5301b127-7219-46d1-8fd2-dc80ca7e96ee Total jobs = 1 Launching Job 1 out of 1 Status: Failed Vertex failed, vertexName=Map 1, vertexId=vertex_1573060958692_0001_1_00, diagnostics=[Vertex vertex_1573060958692_0001_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: _dummy_table initializer failed, vertex=vertex_1573060958692_0001_1_00 [Map 1], org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/tmp/root/a9b76683-8e19-446a-be74-7a5daedf70e5/hive_2019-11-06_17-29-49_820_224977921325223208-2/dummy_path at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:332) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274) at org.apache.hadoop.hive.shims.Hadoop23Shims$1.listStatus(Hadoop23Shims.java:134) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217) at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:76) at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:321) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:444) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:564) at org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateOldSplits(MRInputHelpers.java:488) at org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(MRInputHelpers.java:337) at org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:122) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) My versions are as follows: Hadoop 3.2.1 Hive 3.1.2 Tez 0.9.2 Any help is much appreciated!

Online	Offline
Last Visited	‎08-25-2023 04:35 AM

Member Since	‎11-06-2019 09:36 AM
Last Visited	‎08-25-2023 04:35 AM
Posts	4

Cloudera Community

Re: YARN Node Labels - Effective Capacity is 0% on...

Re: Hive Not Returning YARN Application Results Co...

Re: YARN Node Labels - Effective Capacity is 0% on...

YARN Node Labels - Effective Capacity is 0% on lab...

Re: Hive Not Returning YARN Application Results Co...

Hive Not Returning YARN Application Results Correc...