Created on 06-10-2023 02:08 AM - edited on 06-12-2023 03:47 AM by VidyaSargur
Cannot access the Atlas service or UI when implementing/setting up Atlas.
If an examination of Atlas logs reveals numerous and seemingly unrelated errors, ascertain if JVM (Java Virtual Machine) heaps are working correctly. Three primary commands provide real time insight into the configuration and performance of any active JVM heap:
Identify the PID/full real time startup config of the JVM heap you wish to analyze:
ps -ef | grep -i <Reg-Ex to target the specific heap> |
Investigate JVM utilization over time:
jstat -gcutil -t -h10 <JVM Heap PID> 1000 |
Identify overall JVM capacity:
jstat -gccapacity -t -h10 <JVM Heap PID> 1000 |
Before making any amendments, record the current configuration and heap stats to quantify improvement later on, or to inform your backout plan should that be necessary.
Identify the PID/startup-config of the JVM heap you want to analyze:
[root@atlas-001 ~]# ps -ef | grep -i atlas atlas 228380 228372 99 Sep22 ? 10-05:23:23 /usr/java/jdk1.8.0_181/bin/java -Dproc_atlasmetadata -Datlas.log.dir=/var/log/atlas -Datlas.log.file=application.log -Datlas.home=/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas -Datlas.conf=/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/conf -Xms24576m -Xmx24576m -XX:MaxNewSize=614m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m -server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:/var/log/atlas/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -Dsun.security.krb5.disableReferrals=true -Djdk.tls.ephemeralDHKeySize=2048 -Dzookeeper.sasl.client.username=zookeeper -Dlog4j.configuration=file:/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/conf/atlas-log4j.properties -Djava.security.auth.login.config=/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/conf/atlas_jaas.conf -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/atlas_atlas-ATLAS_SERVER-edacd435b585705bd4df6b88ac0c1614_pid228380.hprof -XX:OnOutOfMemoryError=/opt/cloudera/cm-agent/service/common/killparent.sh -classpath /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/conf:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas/WEB-INF/classes:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas/WEB-INF/lib/*:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/libext/*:/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/hbase-conf:/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/conf org.apache.atlas.Atlas -app /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas |
The highlighted configuration above indicates that the default sizes of the OldSize (tenured size) were 24GB (24576 MB) and 614MB for the that the NewSize, which is out of line with JVM best practices, which suggests a proper tuning ratio 1:2 (NewSize:OldSize).
Investigate JVM utilization over time:
[root@atlas-001 ~]# jstat -gcutil -t -h10 228380 1000 Timestamp S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 62114.4 0.00 100.00 92.38 18.08 97.50 93.48 105017 10050.229 380 23.572 10073.801 62115.4 0.00 100.00 92.38 18.08 97.50 93.48 105017 10050.229 380 23.572 10073.801 62116.4 0.00 100.00 92.38 18.08 97.50 93.48 105017 10050.229 380 23.572 10073.801 62117.4 0.00 100.00 92.38 18.08 97.50 93.48 105017 10050.229 380 23.572 10073.801 62118.4 0.00 100.00 92.39 18.08 97.50 93.48 105017 10050.229 380 23.572 10073.801 |
The highlighted text above quantifies how long the Atlas service has been running (62114.4 seconds) and the GCT (Garbage Collection Time), which is how long the heap was in a state of garbage collection (10073.801 seconds), meaning the JVM heap was stuck 16% of the time (10073/62114*100). GCT is normal and expected within a JVM heap, but should occur only 0-5% of the time.
Identify overall JVM capacity:
[root@atlas-001 ~]jstat -gccapacity -t -h10 228380 1000 Timestamp NGCMN NGCMX NGC S0C S1C EC OGCMN OGCMX OGC OC MCMN MCMX MC CCSMN CCSMX CCSC YGC FGC 64654.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0 0.0 1130496.0 92288.0 0.0 1048576.0 10624.0 106122 384 64655.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0 0.0 1130496.0 92288.0 0.0 1048576.0 10624.0 106122 384 64656.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0 0.0 1130496.0 92288.0 0.0 1048576.0 10624.0 106122 384 64657.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0 0.0 1130496.0 92288.0 0.0 1048576.0 10624.0 106122 384 64658.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0 0.0 1130496.0 92288.0 0.0 1048576.0 10624.0 106122 384 |
Increase Atlas' overall heap size to 30GB, the max heap value that can be set before issues called compressed oops (ordinary object pointers) occur within Java. This enables a 64-bit JVM to address heap sizes up to 32GB using 4-byte pointers. For larger heap sizes, 8-byte pointers are required.
Increase Atlas NewSize to 10GB to achieve the 1:2 NewSize-to-OldSize best practice ratio.
Compare configurations by rerunning the JVM commands, first to identify the PID/startup-config of the JVM heap:
[root@atlas-001 ~]# ps -ef | grep -i atlas atlas 158680 158671 40 Sep26? 06:41:28 /usr/java/jdk1.8.0_181/bin/java -Dproc_atlasmetadata -Datlas.log.dir=/var/log/atlas -Datlas.log.file=application.log -Datlas.home=/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas -Datlas.conf=/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/conf -Xms30720m -Xmx30720m -XX:MaxNewSize=614m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m -server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:/var/log/atlas/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -Dsun.security.krb5.disableReferrals=true -Djdk.tls.ephemeralDHKeySize=2048 -Dzookeeper.sasl.client.username=zookeeper -Dlog4j.configuration=file:/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/conf/atlas-log4j.properties -Djava.security.auth.login.config=/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/conf/atlas_jaas.conf -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/atlas_atlas-ATLAS_SERVER-edacd435b585705bd4df6b88ac0c1614_pid158680.hprof -XX:OnOutOfMemoryError=/opt/cloudera/cm-agent/service/common/killparent.sh -XX:MaxNewSize=10240m -classpath /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/conf:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas/WEB-INF/classes:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas/WEB-INF/lib/*:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/libext/*:/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/hbase-conf:/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/conf org.apache.atlas.Atlas -app /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas |
The highlighted configuration above shows that the overall heap size increased to 30GB (30720 MB) and the NewSize increased to 10GB (10240 MB) after executing the safety valve override. The 1:3 ratio is well within JVM's best practice ratio of 1:2.
Investigate JVM utilisation over time
[root@atlas-001 ~]# jstat -gcutil -t -h10 158680 1000 Timestamp S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 59794.0 0.00 62.91 21.01 35.90 96.25 92.35 699 80.724 4 0.867 81.591 59795.0 0.00 62.91 26.83 35.90 96.25 92.35 699 80.724 4 0.867 81.591 59796.0 0.00 62.91 30.71 35.90 96.25 92.35 699 80.724 4 0.867 81.591 59797.0 0.00 62.91 34.59 35.90 96.25 92.35 699 80.724 4 0.867 81.591 59798.0 0.00 62.91 38.47 35.90 96.25 92.35 699 80.724 4 0.867 81.591 |
The highlighted text above quantifies that the Atlas service ran for 59794.0 seconds and that Garbage Collection Time was reduced to 81.5 seconds. The JVM heap was stuck only 0.13% of the time (81/59794*100), exactly what you want to see. GC time is now so infrequent that the heap is no longer blocking the normal operation of the Atlas service.
Identify overall JVM capacity
[root@atlas-001 ~]# jstat -gccapacity -t -h10 158680 1000 Timestamp NGCMN NGCMX NGC S0C S1C EC OGCMN OGCMX OGC OC MCMN MCMX MC CCSMN CCSMX CCSC YGC FGC 59817.2 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0 0.0 1157120.0 122128.0 0.0 1048576.0 13868.0 699 4 59818.2 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0 0.0 1157120.0 122128.0 0.0 1048576.0 13868.0 699 4 59819.3 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0 0.0 1157120.0 122128.0 0.0 1048576.0 13868.0 700 4 59820.2 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0 0.0 1157120.0 122128.0 0.0 1048576.0 13868.0 700 4 59821.2 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0 0.0 1157120.0 122128.0 0.0 1048576.0 13868.0 700 4 |
Once you restart the Atlas service, you should be able to access the Atlas UI and service.