Cloudera Data Analytics (CDA) Articles

VidyaSargur · ‎06-10-2023

Summary

Cannot access the Atlas service or UI when implementing/setting up Atlas.

Investigation

If an examination of Atlas logs reveals numerous and seemingly unrelated errors, ascertain if JVM (Java Virtual Machine) heaps are working correctly. Three primary commands provide real time insight into the configuration and performance of any active JVM heap:

Identify the PID/full real time startup config of the JVM heap you wish to analyze:

ps -ef | grep -i <Reg-Ex to target the specific heap>

Investigate JVM utilization over time:

jstat -gcutil -t -h10 <JVM Heap PID> 1000

Identify overall JVM capacity:

jstat -gccapacity -t -h10 <JVM Heap PID> 1000

Before making any amendments, record the current configuration and heap stats to quantify improvement later on, or to inform your backout plan should that be necessary.

Identify the PID/startup-config of the JVM heap you want to analyze:

[root@atlas-001 ~]# ps -ef | grep -i atlas

atlas 228380 228372 99 Sep22 ? 10-05:23:23 /usr/java/jdk1.8.0_181/bin/java -Dproc_atlasmetadata -Datlas.log.dir=/var/log/atlas -Datlas.log.file=application.log -Datlas.home=/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas -Datlas.conf=/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/conf -Xms24576m -Xmx24576m -XX:MaxNewSize=614m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m -server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:/var/log/atlas/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -Dsun.security.krb5.disableReferrals=true -Djdk.tls.ephemeralDHKeySize=2048 -Dzookeeper.sasl.client.username=zookeeper -Dlog4j.configuration=file:/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/conf/atlas-log4j.properties -Djava.security.auth.login.config=/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/conf/atlas_jaas.conf -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/atlas_atlas-ATLAS_SERVER-edacd435b585705bd4df6b88ac0c1614_pid228380.hprof -XX:OnOutOfMemoryError=/opt/cloudera/cm-agent/service/common/killparent.sh -classpath /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/conf:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas/WEB-INF/classes:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas/WEB-INF/lib/*:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/libext/*:/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/hbase-conf:/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/conf org.apache.atlas.Atlas -app /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas

The highlighted configuration above indicates that the default sizes of the OldSize (tenured size) were 24GB (24576 MB) and 614MB for the that the NewSize, which is out of line with JVM best practices, which suggests a proper tuning ratio 1:2 (NewSize:OldSize).

Investigate JVM utilization over time:

[root@atlas-001 ~]# jstat -gcutil -t -h10 228380 1000

Timestamp S0 S1 E O M CCS YGC YGCT FGC FGCT GCT

62114.4 0.00 100.00 92.38 18.08 97.50 93.48 105017 10050.229 380 23.572 10073.801

62115.4 0.00 100.00 92.38 18.08 97.50 93.48 105017 10050.229 380 23.572 10073.801

62116.4 0.00 100.00 92.38 18.08 97.50 93.48 105017 10050.229 380 23.572 10073.801

62117.4 0.00 100.00 92.38 18.08 97.50 93.48 105017 10050.229 380 23.572 10073.801

62118.4 0.00 100.00 92.39 18.08 97.50 93.48 105017 10050.229 380 23.572 10073.801

The highlighted text above quantifies how long the Atlas service has been running (62114.4 seconds) and the GCT (Garbage Collection Time), which is how long the heap was in a state of garbage collection (10073.801 seconds), meaning the JVM heap was stuck 16% of the time (10073/62114*100). GCT is normal and expected within a JVM heap, but should occur only 0-5% of the time.

Identify overall JVM capacity:

[root@atlas-001 ~]jstat -gccapacity -t -h10 228380 1000

Timestamp NGCMN NGCMX NGC S0C S1C EC OGCMN OGCMX OGC OC MCMN MCMX MC CCSMN CCSMX CCSC YGC FGC

64654.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0 0.0 1130496.0 92288.0 0.0 1048576.0 10624.0 106122 384

64655.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0 0.0 1130496.0 92288.0 0.0 1048576.0 10624.0 106122 384

64656.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0 0.0 1130496.0 92288.0 0.0 1048576.0 10624.0 106122 384

64657.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0 0.0 1130496.0 92288.0 0.0 1048576.0 10624.0 106122 384

64658.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0 0.0 1130496.0 92288.0 0.0 1048576.0 10624.0 106122 384

Resolution

Increase Atlas' overall heap size to 30GB, the max heap value that can be set before issues called compressed oops (ordinary object pointers) occur within Java. This enables a 64-bit JVM to address heap sizes up to 32GB using 4-byte pointers. For larger heap sizes, 8-byte pointers are required.

Increase Atlas NewSize to 10GB to achieve the 1:2 NewSize-to-OldSize best practice ratio.

Compare configurations by rerunning the JVM commands, first to identify the PID/startup-config of the JVM heap:

[root@atlas-001 ~]# ps -ef | grep -i atlas

atlas 158680 158671 40 Sep26? 06:41:28 /usr/java/jdk1.8.0_181/bin/java -Dproc_atlasmetadata -Datlas.log.dir=/var/log/atlas -Datlas.log.file=application.log -Datlas.home=/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas -Datlas.conf=/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/conf -Xms30720m -Xmx30720m -XX:MaxNewSize=614m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m -server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:/var/log/atlas/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -Dsun.security.krb5.disableReferrals=true -Djdk.tls.ephemeralDHKeySize=2048 -Dzookeeper.sasl.client.username=zookeeper -Dlog4j.configuration=file:/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/conf/atlas-log4j.properties -Djava.security.auth.login.config=/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/conf/atlas_jaas.conf -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/atlas_atlas-ATLAS_SERVER-edacd435b585705bd4df6b88ac0c1614_pid158680.hprof -XX:OnOutOfMemoryError=/opt/cloudera/cm-agent/service/common/killparent.sh -XX:MaxNewSize=10240m -classpath /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/conf:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas/WEB-INF/classes:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas/WEB-INF/lib/*:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/libext/*:/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/hbase-conf:/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/conf org.apache.atlas.Atlas -app /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas

The highlighted configuration above shows that the overall heap size increased to 30GB (30720 MB) and the NewSize increased to 10GB (10240 MB) after executing the safety valve override. The 1:3 ratio is well within JVM's best practice ratio of 1:2.

Investigate JVM utilisation over time

[root@atlas-001 ~]# jstat -gcutil -t -h10 158680 1000

Timestamp S0 S1 E O M CCS YGC YGCT FGC FGCT GCT

59794.0 0.00 62.91 21.01 35.90 96.25 92.35 699 80.724 4 0.867 81.591

59795.0 0.00 62.91 26.83 35.90 96.25 92.35 699 80.724 4 0.867 81.591

59796.0 0.00 62.91 30.71 35.90 96.25 92.35 699 80.724 4 0.867 81.591

59797.0 0.00 62.91 34.59 35.90 96.25 92.35 699 80.724 4 0.867 81.591

59798.0 0.00 62.91 38.47 35.90 96.25 92.35 699 80.724 4 0.867 81.591

The highlighted text above quantifies that the Atlas service ran for 59794.0 seconds and that Garbage Collection Time was reduced to 81.5 seconds. The JVM heap was stuck only 0.13% of the time (81/59794*100), exactly what you want to see. GC time is now so infrequent that the heap is no longer blocking the normal operation of the Atlas service.

Identify overall JVM capacity

[root@atlas-001 ~]# jstat -gccapacity -t -h10 158680 1000

Timestamp NGCMN NGCMX NGC S0C S1C EC OGCMN OGCMX OGC OC MCMN MCMX MC CCSMN CCSMX CCSC YGC FGC

59817.2 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0 0.0 1157120.0 122128.0 0.0 1048576.0 13868.0 699 4

59818.2 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0 0.0 1157120.0 122128.0 0.0 1048576.0 13868.0 699 4

59819.3 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0 0.0 1157120.0 122128.0 0.0 1048576.0 13868.0 700 4

59820.2 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0 0.0 1157120.0 122128.0 0.0 1048576.0 13868.0 700 4

59821.2 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0 0.0 1157120.0 122128.0 0.0 1048576.0 13868.0 700 4

Once you restart the Atlas service, you should be able to access the Atlas UI and service.

Cloudera Community

Cloudera Data Analytics (CDA) Articles

Optimize the Atlas Server Heap - NewSize

Apache Atlas

Summary

Investigation

Resolution