Cloudera Data Analytics (CDA) Articles

Labels (1)
avatar
Cloudera Employee

Summary

Cannot access the Atlas service or UI when implementing/setting up Atlas. 

Investigation

If an examination of Atlas logs reveals numerous and seemingly unrelated errors, ascertain if JVM (Java Virtual Machine) heaps are working correctly. Three primary commands provide real time insight into the configuration and performance of any active JVM heap:

 

Identify the PID/full real time startup config of the JVM heap you wish to analyze:

ps -ef | grep -i <Reg-Ex to target the specific heap>

 

Investigate JVM utilization over time:

jstat -gcutil -t -h10 <JVM Heap PID> 1000

 

Identify overall JVM capacity:

jstat -gccapacity -t -h10 <JVM Heap PID> 1000

 

Before making any amendments, record the current configuration and heap stats to quantify improvement later on, or to inform your backout plan should that be necessary.

Identify the PID/startup-config of the JVM heap you want to analyze:

[root@atlas-001 ~]# ps -ef | grep -i atlas

atlas    228380 228372 99 Sep22 ?        10-05:23:23 /usr/java/jdk1.8.0_181/bin/java -Dproc_atlasmetadata -Datlas.log.dir=/var/log/atlas -Datlas.log.file=application.log -Datlas.home=/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas -Datlas.conf=/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/conf -Xms24576m -Xmx24576m -XX:MaxNewSize=614m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m -server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:/var/log/atlas/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -Dsun.security.krb5.disableReferrals=true -Djdk.tls.ephemeralDHKeySize=2048 -Dzookeeper.sasl.client.username=zookeeper -Dlog4j.configuration=file:/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/conf/atlas-log4j.properties -Djava.security.auth.login.config=/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/conf/atlas_jaas.conf -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/atlas_atlas-ATLAS_SERVER-edacd435b585705bd4df6b88ac0c1614_pid228380.hprof -XX:OnOutOfMemoryError=/opt/cloudera/cm-agent/service/common/killparent.sh -classpath /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/conf:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas/WEB-INF/classes:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas/WEB-INF/lib/*:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/libext/*:/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/hbase-conf:/var/run/cloudera-scm-agent/process/1546765845-atlas-ATLAS_SERVER/conf org.apache.atlas.Atlas -app /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas

The highlighted configuration above indicates that the default sizes of the OldSize (tenured size) were 24GB (24576 MB) and 614MB for the that the NewSize, which is out of line with JVM best practices, which suggests a proper tuning ratio 1:2 (NewSize:OldSize).

Investigate JVM utilization over time:

[root@atlas-001 ~]# jstat -gcutil -t -h10 228380 1000

Timestamp         S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT

        62114.4   0.00 100.00  92.38  18.08  97.50  93.48 105017 10050.229   380   23.572 10073.801

        62115.4   0.00 100.00  92.38  18.08  97.50  93.48 105017 10050.229   380   23.572 10073.801

        62116.4   0.00 100.00  92.38  18.08  97.50  93.48 105017 10050.229   380   23.572 10073.801

        62117.4   0.00 100.00  92.38  18.08  97.50  93.48 105017 10050.229   380   23.572 10073.801

        62118.4   0.00 100.00  92.39  18.08  97.50  93.48 105017 10050.229   380   23.572 10073.801

 

The highlighted text above quantifies how long the Atlas service has been running (62114.4 seconds) and the GCT (Garbage Collection Time), which is how long the heap was in a state of garbage collection (10073.801 seconds), meaning the JVM heap was stuck 16% of the time (10073/62114*100). GCT is normal and expected within a JVM heap, but should occur only 0-5% of the time.

Identify overall JVM capacity:

[root@atlas-001 ~]jstat -gccapacity -t -h10 228380 1000

Timestamp        NGCMN    NGCMX     NGC     S0C   S1C       EC      OGCMN      OGCMX       OGC         OC       MCMN     MCMX      MC     CCSMN    CCSMX     CCSC    YGC    FGC

        64654.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0      0.0 1130496.0  92288.0      0.0 1048576.0  10624.0 106122   384

        64655.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0      0.0 1130496.0  92288.0      0.0 1048576.0  10624.0 106122   384

        64656.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0      0.0 1130496.0  92288.0      0.0 1048576.0  10624.0 106122   384

        64657.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0      0.0 1130496.0  92288.0      0.0 1048576.0  10624.0 106122   384

        64658.4 628736.0 628736.0 628736.0 62848.0 62848.0 503040.0 24537088.0 24537088.0 24537088.0 24537088.0      0.0 1130496.0  92288.0      0.0 1048576.0  10624.0 106122   384

Resolution

Increase Atlas' overall heap size to 30GB, the max heap value that can be set before issues called compressed oops (ordinary object pointers) occur within Java. This enables a 64-bit JVM to address heap sizes up to 32GB using 4-byte pointers. For larger heap sizes, 8-byte pointers are required.

 

Increase Atlas NewSize to 10GB to achieve the 1:2 NewSize-to-OldSize best practice ratio.

MichaelBush_0-1686387933055.png

Compare configurations by rerunning the JVM commands, first to identify the PID/startup-config of the JVM heap:

[root@atlas-001 ~]# ps -ef | grep -i atlas

atlas    158680 158671 40 Sep26?        06:41:28 /usr/java/jdk1.8.0_181/bin/java -Dproc_atlasmetadata -Datlas.log.dir=/var/log/atlas -Datlas.log.file=application.log -Datlas.home=/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas -Datlas.conf=/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/conf -Xms30720m -Xmx30720m -XX:MaxNewSize=614m -XX:MetaspaceSize=100M -XX:MaxMetaspaceSize=512m -server -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof -Xloggc:/var/log/atlas/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -Dsun.security.krb5.disableReferrals=true -Djdk.tls.ephemeralDHKeySize=2048 -Dzookeeper.sasl.client.username=zookeeper -Dlog4j.configuration=file:/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/conf/atlas-log4j.properties -Djava.security.auth.login.config=/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/conf/atlas_jaas.conf -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/atlas_atlas-ATLAS_SERVER-edacd435b585705bd4df6b88ac0c1614_pid158680.hprof -XX:OnOutOfMemoryError=/opt/cloudera/cm-agent/service/common/killparent.sh -XX:MaxNewSize=10240m -classpath /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/conf:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas/WEB-INF/classes:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas/WEB-INF/lib/*:/opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/libext/*:/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/hbase-conf:/var/run/cloudera-scm-agent/process/1546775123-atlas-ATLAS_SERVER/conf org.apache.atlas.Atlas -app /opt/cloudera/parcels/CDH-7.1.7-1.cdh7.1.7.p96.22699993/lib/atlas/server/webapp/atlas

The highlighted configuration above shows that the overall heap size increased to 30GB (30720 MB) and the NewSize increased to 10GB (10240 MB) after executing the safety valve override. The 1:3 ratio is well within JVM's best practice ratio of 1:2.

Investigate JVM utilisation over time

[root@atlas-001 ~]# jstat -gcutil -t -h10 158680 1000

Timestamp         S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT

        59794.0   0.00  62.91  21.01  35.90  96.25  92.35    699   80.724     4    0.867   81.591

        59795.0   0.00  62.91  26.83  35.90  96.25  92.35    699   80.724     4    0.867   81.591

        59796.0   0.00  62.91  30.71  35.90  96.25  92.35    699   80.724     4    0.867   81.591

        59797.0   0.00  62.91  34.59  35.90  96.25  92.35    699   80.724     4    0.867   81.591

        59798.0   0.00  62.91  38.47  35.90  96.25  92.35    699   80.724     4    0.867   81.591

 

The highlighted text above quantifies that the Atlas service ran for 59794.0 seconds and that Garbage Collection Time was reduced to 81.5 seconds. The JVM heap was stuck only 0.13% of the time (81/59794*100), exactly what you want to see. GC time is now so infrequent that the heap is no longer blocking the normal operation of the Atlas service. 


Identify overall JVM capacity

[root@atlas-001 ~]# jstat -gccapacity -t -h10 158680 1000

Timestamp        NGCMN    NGCMX     NGC     S0C   S1C       EC      OGCMN      OGCMX       OGC         OC       MCMN     MCMX      MC     CCSMN    CCSMX     CCSC    YGC    FGC

        59817.2 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0      0.0 1157120.0 122128.0      0.0 1048576.0  13868.0    699     4

        59818.2 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0      0.0 1157120.0 122128.0      0.0 1048576.0  13868.0    699     4

        59819.3 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0      0.0 1157120.0 122128.0      0.0 1048576.0  13868.0    700     4

        59820.2 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0      0.0 1157120.0 122128.0      0.0 1048576.0  13868.0    700     4

        59821.2 10485760.0 10485760.0 10485760.0 1048576.0 1048576.0 8388608.0 20971520.0 20971520.0 20971520.0 20971520.0      0.0 1157120.0 122128.0      0.0 1048576.0  13868.0    700     4

 

Once you restart the Atlas service, you should be able to access the Atlas UI and service.

1,235 Views
0 Kudos