Member since
09-29-2015
123
Posts
216
Kudos Received
47
Solutions
12-30-2016
07:02 PM
@Joshua Adeleke , there is a statement in the article that "block deletion activity in HDFS is asynchronous". This statement also applies when finalizing an upgrade. Since this processing happens asynchronously, it's difficult to put an accurate wall clock estimate on it. In my experience, I've generally seen the physical on-disk block deletions start happening 2-5 minutes after finalizing an upgrade.
... View more
06-16-2016
11:41 PM
9 Kudos
Summary HDFS Rolling Upgrade facilitates software upgrade of independent individual components in an HDFS cluster. During the upgrade window, HDFS will not physically delete blocks. Normal block deletion resumes after the administrator finalizes the upgrade. A common source of operational problems is forgetting to finalize an upgrade. If left unaddressed, HDFS will run out of storage capacity. Attempts to delete files will not free space. To avoid this problem, always finalize HDFS rolling upgrades in a timely fashion. This information applies to both Ambari Rolling & Express Upgrade. Rolling Upgrade Block Handling The high-level workflow of a rolling upgrade for the administrator is:
Initiate rolling upgrade. Perform software upgrade on individual nodes. Run typical workloads and validate new software works. If validation is successful, finalize the upgrade. If validation discovers a problem, revert to the prior software via one of 2 options:
Rollback - Restore prior software and restore cluster data to its pre-upgrade state. Downgrade - Restore prior software, but preserve data changes that occurred during the upgrade window. The Apache Hadoop documentation on HDFS Rolling Upgrade covers the specific commands in more detail. To satisfy the requirements of Rollback, HDFS will not delete blocks during a rolling upgrade window, which is the time between initiating the rolling upgrade and finalizing it. During this window, DataNodes handle block deletions by moving the blocks to a special directory named "trash" instead of physically deleting them. While the blocks reside in trash, they are not visible to clients performing reads. Thus, the files are logically deleted, but the blocks still consume physical space on the DataNode volumes. If the administrator chooses to rollback, the DataNodes restore these blocks from the trash directory to restore the cluster's data to its pre-upgrade state. After the upgrade is finalized, normal block deletion processing resumes. Blocks previously saved to trash will be physically deleted. New deletion activity will result in a physical delete, not moving the block to trash. Block deletion is asynchronous, so there may be propagation delays between the user deleting a file and the space being freed as reported by tools like "hdfs dfsadmin -report". Impact on HDFS Space Utilization An important consequence of this behavior is that during a rolling upgrade window, HDFS space utilization will rise continuously. Attempting to free space by deleting files will be ineffective, because the blocks will be moved to the trash directory instead of physically deleted. Please also note that this behavior applies not only to files that existed before the upgrade, but also new files created during the upgrade window. All deletes are handled by moving the blocks to trash. An administrator might notice that even after deleting a large amount of files, various tools continue to report high space consumption. This includes "hdfs dfsadmin -report", JMX metrics (which are consumed by Apache Ambari) and the NameNode web UI. If a cluster shows these symptoms, check if a rolling upgrade has not been finalized. There are multiple ways to check this. The "hdfs dfsadmin -rollingUpgrade query" command will report "Proceed with rolling upgrade", and the "Finalize Time" will be unspecified. > hdfs dfsadmin -rollingUpgrade query
QUERY rolling upgrade ...
Proceed with rolling upgrade:
Block Pool ID: BP-1273075337-10.22.2.98-1466102062415
Start Time: Thu Jun 16 14:55:09 PDT 2016 (=1466114109053)
Finalize Time: <NOT FINALIZED> The NameNode web UI will display a banner at the top stating "Rolling upgrade started". JMX metrics also expose "RollingUpgradeStatus", which will have a "finalizeTime" of 0 if the upgrade has not been finalized. > curl 'http://10.22.2.98:9870/jmx?qry=Hadoop:service=NameNode,name=NameNodeInfo'
...
"RollingUpgradeStatus" : {
"blockPoolId" : "BP-1273075337-10.22.2.98-1466102062415",
"createdRollbackImages" : true,
"finalizeTime" : 0,
"startTime" : 1466114109053
},
... DataNode Disk Layout This section explores the layout on disk for DataNodes that have logically deleted blocks during a rolling upgrade window. The following discussion uses a small testing cluster containing only one file. This shows a typical disk layout on a DataNode volume hosting exactly one block replica: data/dfs/data/current
├── BP-1273075337-10.22.2.98-1466102062415
│ ├── current
│ │ ├── VERSION
│ │ ├── dfsUsed
│ │ ├── finalized
│ │ │ └── subdir0
│ │ │ └── subdir0
│ │ │ ├── blk_1073741825
│ │ │ └── blk_1073741825_1001.meta
│ │ └── rbw
│ ├── scanner.cursor
│ └── tmp
└── VERSION The block file and its corresponding metadata file are in the "finalized" directory. If this file were deleted during a rolling upgrade window, then the block file and its corresponding metadata file would move to the trash directory: data/dfs/data/current
├── BP-1273075337-10.22.2.98-1466102062415
│ ├── RollingUpgradeInProgress
│ ├── current
│ │ ├── VERSION
│ │ ├── dfsUsed
│ │ ├── finalized
│ │ │ └── subdir0
│ │ │ └── subdir0
│ │ └── rbw
│ ├── scanner.cursor
│ ├── tmp
│ └── trash
│ └── finalized
│ └── subdir0
│ └── subdir0
│ ├── blk_1073741825
│ └── blk_1073741825_1001.meta
└── VERSION As a reminder, block deletion activity in HDFS is asynchronous. It may take several minutes after running the "hdfs dfs -rm" command before the block moves from finalized to trash. One way to determine extra space consumption by logically deleted files is to run a "du" command on the trash directory. > du -hs data/dfs/data/current/BP-1273075337-10.22.2.98-1466102062415/trash
8.0K data/dfs/data/current/BP-1273075337-10.22.2.98-1466102062415/trash Assuming relatively even data distribution across nodes in the cluster, if this shows that a significant proportion of the volume's capacity is consumed by the trash directory, then that is a sign that the unfinalized rolling upgrade is the source of the space consumption. Conclusion Finalize those upgrades!
... View more
Labels:
06-11-2016
08:09 PM
@Robert Levas, thanks for the great article! May I also suggest adding information about the "hadoop kerbname" or "hadoop org.apache.hadoop.security.HadoopKerberosName" shell command? This is a helpful debugging tool that prints the current prinicipal's short name after Hadoop applies the currently configured auth_to_local rules. If you'd like, feel free to copy-paste my text from this answer: https://community.hortonworks.com/questions/38573/pig-view-hdfs-test-failing-service-hdfs-check-fail.html .
... View more
06-08-2016
10:09 PM
2 Kudos
Hello @Mingliang Liu. Nice article! I'd like to add that in step 7, when doing a distro build, I often like to speed it up a little more by passing the argument -Dmaven.javadoc.skip=true. As long as I don't need to inspect JavaDoc changes, this can make the build complete faster.
... View more
06-08-2016
06:51 PM
13 Kudos
LDAP Usage
Hadoop may be configured to use LDAP as the source for resolving an authenticated user's list of group memberships. A common example where Hadoop needs to resolve group memberships is the permission checks performed by HDFS at the NameNode. The Apache documentation's HDFS Permissions Guide contains further discussion of how the group mapping works: the NameNode calls a configurable plugin to get the user's group memberships before checking permissions. Despite that document's focus on group resolution at the NameNode, many other Hadoop processes also call the group mapping. The information in this document applies to the entire ecosystem of Hadoop-related components. As described in that document, the exact implementation of the group mapping is configurable. Here is the documentation of the configuration property from core-default.xml and its default value. <property>
<name>hadoop.security.group.mapping</name>
<value>org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback</value>
<description>
Class for user to group mapping (get groups for a given user) for ACL.
The default implementation,
org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback,
will determine if the Java Native Interface (JNI) is available. If JNI is
available the implementation will use the API within hadoop to resolve a
list of groups for a user. If JNI is not available then the shell
implementation, ShellBasedUnixGroupsMapping, is used. This implementation
shells out to the Linux/Unix environment with the
<code>bash -c groups</code> command to resolve a list of groups for a user.
</description>
</property> LDAP integration arises from several possible configuration scenarios: hadoop.security.group.mapping=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback, and the host OS integrates directly with LDAP, such as via pam_ldap. A Hadoop process will look up group memberships via standard syscalls, and those syscalls will delegate to pam_ldap. hadoop.security.group.mapping=org.apache.hadoop.security.LdapGroupsMapping. A Hadoop process will call the LDAP server directly. This can be useful if the host OS cannot integrate with LDAP for some reason. As a side effect, it is possible that Hadoop will see a different list of group memberships for a user compared to what the host OS reports, such as by running the "groups" command at the shell. Since group mapping is pluggable, it is possible (though rare) that a deployment has configured hadoop.security.group.mapping as a custom implementation of the org.apache.hadoop.security.GroupMappingServiceProvider interface. In that case, the integration pattern will vary depending on the implementation details. Troubleshooting Group Membership If there is any doubt about how Hadoop is resolving a user's group memberships, then a helpful troubleshooting step is to run the following command while logged in as the user. This will print authentication information for the current user, including group memberships, as they are really seen by the Hadoop code. > hadoop org.apache.hadoop.security.UserGroupInformation
Getting UGI for current user
User: chris
Group Ids:
Groups: staff everyone localaccounts _appserverusr admin _appserveradm _lpadmin _appstore _lpoperator _developer com.apple.access_screensharing com.apple.access_ssh
UGI: chris (auth:SIMPLE)
Auth method SIMPLE
Keytab false
============================================================ However, in the case of HDFS file permissions, recall that the group resolution really occurs at the NameNode before it checks authorization for the user. If configuration is different at the NameNode compared to the client host, then it's possible that the NameNode will see different results for the group memberships. To see the NameNode's opinion of the user's group memberships, run the following command. > hdfs groups
chris : staff everyone localaccounts _appserverusr admin _appserveradm _lpadmin _appstore _lpoperator _developer com.apple.access_screensharing com.apple.access_ssh Load Patterns As a distributed system running across hundreds or thousands of nodes, all independently resolving users' group memberships, this usage pattern may generate unexpectedly high call volume to the LDAP infrastructure. Typical symptoms are slow responses from the LDAP server, perhaps resulting in timeouts. If group resolution takes too long, then the Hadoop process may log a message like this: 2016-06-07 13:07:00,831 WARN security.Groups (Groups.java:getGroups(181)) - Potential performance problem: getGroups(user=chris) took 13018 milliseconds. The exact timeout threshold for this warning is configurable, with a default value of 5 seconds. <property>
<name>hadoop.security.groups.cache.warn.after.ms</name>
<value>5000</value>
<description>
If looking up a single user to group takes longer than this amount of
milliseconds, we will log a warning message.
</description>
</property> Impacts The exact impact to the Hadoop process varies. In many cases, such as execution of a YARN container running a map task, the delay simply increases total latency of execution for that container. A more harmful case is slow lookup at the HDFS JournalNode. If multiple JournalNodes simultaneously experience a long delay in group resolution, then it's possible to exceed the NameNode's timeout for JournalNode calls. The NameNode must be able to log edits to a quorum of JournalNodes (i.e. 2 out of 3 JournalNodes). If the calls time out to 2 or more JournalNodes, then it's a fatal condition. The NameNode must be able to log transactions successfully, and if it fails, then it aborts intentionally. This condition would trigger an unwanted HA failover. The problem might reoccur after failover, resulting in flapping. If this happens, then the JournalNode logs will show the "performance problem" mentioned above, and the NameNode logs will show a message about "Timed out waiting for a quorum of nodes to respond" before a FATAL shutdown error. Tuning If your cluster is encountering problems due to high load on LDAP infrastructure, then there are several possible ways to mitigate this by tuning the Hadoop deployment. In-Process Caching Hadoop supports in-process caching of group membership resolution data. There are several configuration properties that control the behavior of the cache. Tuning these properties may help mitigate LDAP load issues. <property>
<name>hadoop.security.groups.cache.secs</name>
<value>300</value>
<description>
This is the config controlling the validity of the entries in the cache
containing the user->group mapping. When this duration has expired,
then the implementation of the group mapping provider is invoked to get
the groups of the user and then cached back.
</description>
</property> <property>
<name>hadoop.security.groups.negative-cache.secs</name>
<value>30</value>
<description>
Expiration time for entries in the the negative user-to-group mapping
caching, in seconds. This is useful when invalid users are retrying
frequently. It is suggested to set a small value for this expiration, since
a transient error in group lookup could temporarily lock out a legitimate
user.
Set this to zero or negative value to disable negative user-to-group caching.
</description>
</property> The NameNode and ResourceManager provide administrative commands for forcing invalidation of the in-process group cache. This can be useful for propagating group membership changes without requiring a restart of the NameNode or ResourceManager process. > hdfs dfsadmin -refreshUserToGroupsMappings
Refresh user to groups mapping successful > yarn rmadmin -refreshUserToGroupsMappings
16/06/08 11:38:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8033 External Caching with Name Service Cache Daemon If the host OS integrates with LDAP (e.g. hadoop.security.group.mapping=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback and the host OS uses pam_ldap), then the Name Service Cache Daemon is an effective approach for caching group memberships at the OS layer. Note that this approach is superior to Hadoop's in-process caching, because nscd would allow multiple Hadoop processes running on the same host to share a common cache and avoid repeated lookups across different processes. However, nscd is unlikely to be beneficial if hadoop.security.group.mapping=org.apache.hadoop.security.LdapGroupsMapping, because Hadoop processes will issue their own LDAP calls directly instead of delegating to the host OS. Static Group Mapping Hadoop also supports specifying a static mapping of users to their group memberships in configuration in core-site.xml. <property>
<name>hadoop.user.group.static.mapping.overrides</name>
<value>dr.who=;</value>
<description>
Static mapping of user to groups. This will override the groups if
available in the system for the specified user. In otherwords, groups
look-up will not happen for these users, instead groups mapped in this
configuration will be used.
Mapping should be in this format.
user1=group1,group2;user2=;user3=group2;
Default, "dr.who=;" will consider "dr.who" as user without groups.
</description>
</property> This approach completely bypasses LDAP (or any other group lookup mechanism) for the specified users. A drawback of this approach is that administrators lose centralized management of group memberships through LDAP for the specified users. In practice, this is not a significant drawback for the HDP service principals, which generally don't change their group memberships. For example: <property>
<name>hadoop.user.group.static.mapping.overrides</name>
<value>hive=hadoop,hive;hdfs=hadoop,hdfs;oozie=users,hadoop,oozie;knox=hadoop;mapred=hadoop,mapred;zookeeper=hadoop;falcon=hadoop;sqoop=hadoop;yarn=hadoop;hcat=hadoop;ams=hadoop;root=hadoop;ranger=hadoop;rangerlogger=hadoop;rangeradmin=hadoop;ambari-qa=hadoop,users;</value>
</property> Static mapping is particularly effective at mitigating the problem of slow group lookups at the JournalNode discussed earlier. JournalNode calls are almost exclusively performed by the hdfs service principal, so specifying it in static mapping prevents the need for the JournalNode to call LDAP. Any configuration tuning would require a restart of the relevant Hadoop process (such as NameNode or JournalNode) for the change to take effect.
... View more
Labels:
02-24-2016
07:36 PM
3 Kudos
Apache JIRA HDFS-9552 is a documentation patch that clarifies exactly what kinds of permission checks are enforced by HDFS for each kind of user operation on a file system path. This documentation is not yet live on hadoop.apache.org. I am posting the same content here in the interim until HDFS-9552 ships in an official Apache release. Permission Checks Each HDFS operation demands that the user has specific permissions (some combination of READ, WRITE and EXECUTE), granted through file ownership, group membership or the other permissions. An operation may perform permission checks at multiple components of the path, not only the final component. Additionally, some operations depend on a check of the owner of a path. All operations require traversal access. Traversal access demands the EXECUTE permission on all existing components of the path, except for the final path component. For example, for any operation accessing /foo/bar/baz, the caller must have EXECUTE permission on /, /foo and /foo/bar. The following table describes the permission checks performed by HDFS for each component of the path.
Ownership: Whether or not to check if the caller is the owner of the path. Typically, operations that change the ownership or permission metadata demand that the caller is the owner. Parent: The parent directory of the requested path. For example, for the path /foo/bar/baz, the parent is /foo/bar. Ancestor: The last existing component of the requested path. For example, for the path /foo/bar/baz, the ancestor path is /foo/bar if /foo/bar exists. The ancestor path is /foo if /foo exists but /foo/bar does not exist. Final: The final component of the requested path. For example, for the path /foo/bar/baz, the final path component is /foo/bar/baz. Sub-tree: For a path that is a directory, the directory itself and all of its child sub-directories, recursively. For example, for the path /foo/bar/baz, which has 2 sub-directories named buz and boo, the sub-tree is /foo/bar/baz, /foo/bar/baz/buz and /foo/bar/baz/boo. Operation Ownership Parent Ancestor Final Sub-tree append NO N/A N/A WRITE N/A concat NO [2] WRITE (sources) N/A READ (sources), WRITE (destination) N/A create NO N/A WRITE WRITE [1] N/A createSnapshot YES N/A N/A N/A N/A delete NO [2] WRITE N/A N/A READ, WRITE, EXECUTE deleteSnapshot YES N/A N/A N/A N/A getAclStatus NO N/A N/A N/A N/A getBlockLocations NO N/A N/A READ N/A getContentSummary NO N/A N/A N/A READ, EXECUTE getFileInfo NO N/A N/A N/A N/A getFileLinkInfo NO N/A N/A N/A N/A getLinkTarget NO N/A N/A N/A N/A getListing NO N/A N/A READ, EXECUTE N/A getSnapshotDiffReport NO N/A N/A READ READ getStoragePolicy NO N/A N/A READ N/A getXAttrs NO N/A N/A READ N/A listXAttrs NO EXECUTE N/A N/A N/A mkdirs NO N/A WRITE N/A N/A modifyAclEntries YES N/A N/A N/A N/A removeAcl YES N/A N/A N/A N/A removeAclEntries YES N/A N/A N/A N/A removeDefaultAcl YES N/A N/A N/A N/A removeXAttr NO [2] N/A N/A WRITE N/A rename NO [2] WRITE (source) WRITE (destination) N/A N/A renameSnapshot YES N/A N/A N/A N/A setAcl YES N/A N/A N/A N/A setOwner YES [3] N/A N/A N/A N/A setPermission YES N/A N/A N/A N/A setReplication NO N/A N/A WRITE N/A setStoragePolicy NO N/A N/A WRITE N/A setTimes NO N/A N/A WRITE N/A setXAttr NO [2] N/A N/A WRITE N/A truncate NO N/A N/A WRITE N/A [1] WRITE access on the final path component during create is only required if the call uses the overwrite option and there is an existing file at the path. [2] Any operation that checks WRITE permission on the parent directory also checks ownership if the sticky bit is set. [3] Calling setOwner to change the user that owns a file requires HDFS super-user access. HDFS super-user access is not required to change the group, but the caller must be a member of the specified group.
... View more
Labels:
02-03-2016
07:53 PM
29 Kudos
Garbage Collection Best Practice GC Configuration This is an example implementation of our current recommendation for best practice GC tuning, driven by requirements of the NameNode: export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote -XX:+UseConcMarkSweepGC -XX:ParallelGCThreads=8 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -Xms1G -Xmx1G -XX:NewSize=128M -XX:MaxNewSize=128M -XX:PermSize=128M -XX:MaxPermSize=256M -verbose:gc -Xloggc:/Users/chris/hadoop-deploy-trunk/hadoop-3.0.0-SNAPSHOT/logs/gc.log-`date +'%Y%m%d%H%M'` -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:ErrorFile=/Users/chris/hadoop-deploy-trunk/hadoop-3.0.0-SNAPSHOT/logs/hs_err_pid%p.log -XX:+HeapDumpOnOutOfMemoryError $HADOOP_NAMENODE_OPTS" This is from a small development environment, so the heap is much smaller (1 GB) than what you’ll see in a real production cluster. In the past, we’ve shared with you what to use for GC settings, but have we ever explained why? GC Tuning Rationale Let’s address each setting individually and explain why we recommend these settings. -XX:+UseConcMarkSweepGC This flag enables the Concurrent Mark Sweep garbage collector. Why do we use this garbage collector? The JVM offers multiple different garbage collector implementations. Each one makes different engineering trade-offs to better serve different kinds of applications. Literature on garbage collection discusses a trade-off between responsiveness (how quickly an application responds) vs. throughput (maximizing the total amount of work done by an application in a period of time). Quoting Oracle’s garbage collection tuning guide: The concurrent collector is designed for applications that prefer shorter garbage collection pauses and that can afford to share processor resources with the garbage collector while the application is running. Typically applications which have a relatively large set of long-lived data (a large tenured generation), and run on machines with two or more processors tend to benefit from the use of this collector. The bold emphasis is mine. These characteristics probably sound familiar for our Hadoop daemons. Taking the NameNode as an example: We prefer shorter pauses, because we don’t want a client’s RPC call blocked waiting for a garbage collection, which would harm latency. It’s fine to share CPU resources between the garbage collector and our application logic. Realistic Hadoop deployments run on multi-core/multi-proc machines where the full resources are dedicated to running Hadoop and no other applications. We have a large set of long-lived data. Every file is represented in memory as an INode object, and it’s alive for the duration of the process, unless someone deletes the file. CMS suits our usage patterns well. Is there anything wrong with CMS? If poorly configured, then fragmentation can accumulate in the tenured generation over time. CMS collections do not perform periodic compaction (rearranging existing object allocations into contiguous space) to fix the fragmentation. In the degenerate case, CMS is forced to “stop the world” to do a compaction. This is not a concurrent operation. We do not expect this to happen with our current recommendations for GC configuration. Personal horror story: I once worked on an application (not Hadoop) with a unique memory allocation pattern that drove CMS into a 2-hour stop-the-world collection and compaction. I’ve been meaning to contact Guinness to find out if I’ve set the world record for longest GC pause. -XX:ParallelGCThreads=8 Sets the number of threads to use for garbage collection (the “concurrency” in “Concurrent Mark Sweep”). Running jstack on the JVM process will show this number of threads named like: "Gang worker#0 (Parallel GC Threads)" We expect tuning this correctly to shorten the amount of time required to complete a garbage collection. In practice, we’ve seen that a value of 8 works well on typical hardware. -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 By default, CMS tracks internal heuristics of the application’s usage pattern to determine when to trigger a garbage collection. Typically, this tends to delay collection until the old generation is nearly full. In the worst case, this can degrade to a stop-the-world full garbage collection. By setting these properties, we tell CMS to initiate garbage collection based solely on fraction of heap usage instead of its internal heuristics. The fractional value must be chosen carefully. If it is too large, then it still might not prevent costly full garbage collections. If it is too small, then garbage collection will run too frequently, reducing the application’s overall throughput. In practice, we have found that a value of 70 works well, though it’s possible that individual deployments might need tuning. -Xms1G -Xmx1G -Xms sets the minimum heap size, and -Xmx sets the maximum heap size. When a JVM process starts, it initially allocates the minimum heap size. If the need for memory arises at runtime, it can go back to the OS and allocate more, up to the maximum heap size. Why do we recommend setting these to the same value? Those incremental memory allocations can cause unpredictable runtime performance. Additionally, starting with a low minimum heap size could cause instability as the need for a larger heap grows over time. If it needs to grow to the maximum, but the OS doesn’t actually have that amount of memory available to hand out, then the process could be subject to OutOfMemoryError conditions, or possibly even the OOM killer stopping the whole process on Linux. Setting minimum and maximum to the same value eliminates these potential sources of instability. -XX:NewSize=128M -XX:MaxNewSize=128M These control the minimum and maximum size of the new generation in the generational garbage collector architecture. We’ve found that for large heaps, the JVM defaults are too small for this. Our current recommendation is to set this to 1/8 to 1/6 of the total heap size. We recommend setting each of these to the same value for stability reasons, similar to the above discussion of total heap size. -XX:PermSize=128M -XX:MaxPermSize=256M This controls the minimum and maximum size of the permanent generation, which is a portion of the heap dedicated to application metadata, such as Java class definitions and interned string variables. If this is too small, then you’ll see errors like: java.lang.OutOfMemoryError: PermGen space These settings are our current recommendation for what works well for our usage patterns. GC Troubleshooting Additionally, some of the settings are helpful to assist with troubleshooting. -verbose:gc This enables verbose garbage collection logging. For example: 2014-10-31T10:37:54.313+0800: 2.211: [GC2014-10-31T10:37:54.313+0800: 2.211: [ParNew: 104960K->9207K(118016K), 0.0099670 secs] 104960K->9207K(1035520K), 0.0100520 secs] [Times: user=0.04 sys=0.01, real=0.01 secs]
2014-10-31T10:37:55.307+0800: 3.204: [GC2014-10-31T10:37:55.307+0800: 3.204: [ParNew: 114167K->13056K(118016K), 0.0388680 secs] 114167K->26862K(1035520K), 0.0389190 secs] [Times: user=0.18 sys=0.03, real=0.04 secs]
2014-10-31T11:44:31.748+0800: 3999.646: [Full GC2014-10-31T11:44:31.748+0800: 3999.646: [CMS: 13806K->22721K(917504K), 0.0854000 secs] 79787K->22721K(1035520K), [CMS Perm : 26935K->26918K(131072K)], 0.0854930 secs] [Times: user=0.08 sys=0.01, real=0.09 secs] Each log line shows: Date/time of garbage collection. Type of garbage collection (i.e. GC vs. FullGC). Which garbage collector ran (i.e. ParNew on the new generation vs. CMS for the full GC). <used heap size before>-><used heap size after>. Timing information. Generally “real” is the most useful metric, because it’s actual clock time. Possible warning signs that heap or GC configuration is incorrect: Very frequent garbage collections. In particular, we expect full GC to be rare. Individual garbage collections take a long “real” time. Full GC happens frequently and does not significantly reduce the amount of used heap. This is often accompanied by OutOfMemoryError in the main logs. This likely means that the process needs to be allocated a larger heap. For example, the NameNode is trying to store more inodes than will fit into its current heap size. -Xloggc:/Users/chris/hadoop-deploy-trunk/hadoop-3.0.0-SNAPSHOT/logs/gc.log-`date +'%Y%m%d%H%M'` This controls the path to the garbage collection log file, with a date/timestamp at the end of the file name corresponding to process start time. -XX:+PrintGCDetails Logs even more details about garbage collection. -XX:+PrintGCTimeStamps Add timestamps to each line in the garbage collection log. -XX:+PrintGCDateStamps Add date to each line in the garbage collection log. -XX:ErrorFile=/Users/chris/hadoop-deploy-trunk/hadoop-3.0.0-SNAPSHOT/logs/hs_err_pid%p.log If the JVM process crashes, logs a crash report to this path. This includes useful information like a list of the threads that were running at the time, and possibly a native code backtrace. If not specified, the default is hs_err_pid<pid>.log in the process working directory. -XX:+HeapDumpOnOutOfMemoryError Dumps the state of the heap to a file when OutOfMemoryError is thrown. This can be useful for post-mortem analysis to see exactly what kinds of objects were consuming a lot of the heap. -XX:HeapDumpPath=<path> If specified, then this is the path to the file where the heap dump will be written. The default is java_pid<pid>.hprof in the process working directory. Tools JConsole JConsole is a general-purpose monitoring tool for displaying JMX metrics. It offers a few helpful things for watching garbage collection, particularly if a currently running process forgot to enable GC logging, and you want to get a glimpse of things without restarting the process to turn on GC logging. The Memory tab charts information on memory usage, and can break it down by the different GC pools. There is also a “Perform GC” button here. That will force a stop-the-world full GC, so don’t click it unless you really mean it. The VM Summary tab has a section that states total amount of time spent in each kind of garbage collector. Keep in mind that this refers to the whole lifetime of the JVM process, so the time spent is ever-growing, and you’ll see larger numbers for longer-lived processes. JVisualVM JVisualVM is a Java Virtual Machine Monitoring, Troubleshooting, and Profiling Tool. It offers multiple plugins, including the Visual GC plugin. This lets you interactively watch activity move across the various generations of the garbage collector. Future Directions: The G1 Garbage Collector The G1 Garbage Collector is a revamped garbage collection architecture that aims to replace CMS. I do not know of anyone who has done extensive testing of Hadoop with G1. We do not test and certify with G1. However, we would like to investigate G1 further in the future and perform testing. One of the most significant benefits over CMS is that it is a compacting collector. (See personal horror story earlier.) Quoting the documentation: The Garbage-First (G1) collector is a server-style garbage collector, targeted for multi-processor machines with large memories. It meets garbage collection (GC) pause time goals with a high probability, while achieving high throughput. The G1 garbage collector is fully supported in Oracle JDK 7 update 4 and later releases. The G1 collector is designed for applications that: Can operate concurrently with applications threads like the CMS collector. Compact free space without lengthy GC induced pause times. Need more predictable GC pause durations. Do not want to sacrifice a lot of throughput performance. Do not require a much larger Java heap. G1 is planned as the long term replacement for the Concurrent Mark-Sweep Collector (CMS). Comparing G1 with CMS, there are differences that make G1 a better solution. One difference is that G1 is a compacting collector. G1 compacts sufficiently to completely avoid the use of fine-grained free lists for allocation, and instead relies on regions. This considerably simplifies parts of the collector, and mostly eliminates potential fragmentation issues. Also, G1 offers more predictable garbage collection pauses than the CMS collector, and allows users to specify desired pause targets. Personal horror story part 2: After the 2-hour GC pause I discussed earlier, we investigated use of G1. It was considered beta at the time. It was designed specifically to address our kind of allocation pattern. What we saw at the time was that G1 out-performed every other garbage collection configuration we tried, including CMS. However, it also caused the JVM process to segfault approximately every 2 days. Based on stability concerns, we couldn’t put this configuration in production. That was several years ago, so perhaps G1 has stabilized by now. References Java Garbage Collection Basics http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html Java SE 6 HotSpot[tm] Virtual Machine Garbage Collection Tuning http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html Java Memory Management http://javabook.compuware.com/content/memory/how-garbage-collection-works.aspx JVM performance optimization, Part 3: Garbage collection http://www.javaworld.com/article/2078645/java-se/jvm-performance-optimization-part-3-garbage-collection.html OOM Killer http://linux-mm.org/OOM_Killer Java Hotspot VM Options http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html Using JConsole http://docs.oracle.com/javase/6/docs/technotes/guides/management/jconsole.html jvisualvm - Java Virtual Machine Monitoring, Troubleshooting, and Profiling Tool http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.html Getting Started with the G1 Garbage Collector http://www.oracle.com/technetwork/tutorials/tutorials-1876574.html
... View more
Labels: