Saturday, May 29, 2010

ConcurrentModeFailure on Full GCs - An analysis

Hit a problem on our production env. - summarized below.

We have 1 admin + 3 agent processes (all Java processes) running per box, across 16 boxes. Over the past couple of weeks, it has been observed that on the 6th or 7th day after appln restart, multiple full GCs occur (sometimes pushing 300+ in a day). This is accompanied by decreased throughput and increase in CPU usage. One point worth mentioning is that the framework we use has a heavy multi-threading mechanism.

Specs are
OS: SunOS Rel.5.10
Number of CPUs: 32
System: Sun Sparc
JDK Version: JDK 1.4.2_18

JVM settings in use are
Performance Options

-Xmx1536m -Xms1536m
Min and Max Heap Size - Excludes PermGen space. Setting -Xms and -Xmx to the same value increases predictability by removing the most important sizing decision from the virtual machine.

-XX:NewSize=512m -XX:MaxNewSize=512m
Min and Max Size of New Generation

-XX:PermSize=64m -XX:MaxPermSize=128m
Min and Max Size of Permanent Generation which holds the JVM's class and method objects. It is recommended to set PermSize equal to MaxPermSize to avoid heap resizing when size of PermGen increases. This is one change we plan to do i.e. Set -XX:PermSize to 128m.

-XX:SurvivorRatio=8 
Young generation space is divided into eden and 2 SSs
The size of each is calculated as
Eden = NewSize - (NewSize/(SurvivorRatio + 2)) * 2) = 409.6MB
From space = (NewSize - Eden) / 2 = 51.2MB
To space = (NewSize - Eden) / 2 = 51.2MB
Old gen space = 1536 - 512 = 1014MB

-XX:TargetSurvivorRatio=90
Tenuring threshold that is used to copy the tenured objects in the young generation. In this case, allows 90% of the survivor spaces to be occupied instead of the default 50%

-XX:ParallelGCThreads=4
Default value is equal to number of CPUs. Optimised in our case to 4

-XX:MaxTenuringThreshold=10 
The number of times an oject is aged in the young generation

Debugging Options

-XX:+PrintGCDetails    
-XX:+PrintGCTimeStamps 
-XX:+PrintTenuringDistribution 
-XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintGCApplicationStoppedTime 

Behavioral Options

-XX:+UseConcMarkSweepGC
Use concurrent mark-sweep collection for the old generation

-XX:+DisableExplicitGC    
Disable calls to System.gc()

-XX:+UseParNewGC          
Use multiple threads for parallel young generation collection

-XX:-UseAdaptiveSizePolicy 
Specifically disable UseAdaptiveSizePolicy since it should be used with UseParallelGC, not UseParNewGC

-Xloggc:/log/file1.gc
Log output file name


A word about how the generations are structured before we progress.

The young generation is eden + 2 survivor spaces. An object when created is allocated in eden. At any given point of time, one SS is empty and serves as a destination of the next. During a minor collection, the living objects in eden are copied to the first SS. If all the living objects in eden do not fit in one SS, the remaining living objecs are promoted into the old generation. During the next minor collection, the living objects from eden and the first SS are copied to the second SS.

As more objects become tenured, the old object space begins to reach maximum occupancy. The GC algo in use is mark-compact collection. It scans all objects, marking all reachable objects, then compacts all remaining gaps of dead objects. The advantage this algo has over copy collect is that it requires less memory and eliminates memory fragmentation.

Also, the concurrent low pause collector starts a collection when the occupancy of the tenured generation reaches a specified value (by default 68%). It attempts to minimize the pauses due to garbage collection by doing most of the garbage collection work concurrently with the application threads.

The concurrent collector uses a background thread that runs concurrently with the application threads to enable both GC and object handling at the same time. The collector collects garbage in phases, two are stop-the-world phases, and four are concurrent i.e. they run with app threads. The phases are:

1. Initial-Mark-Phase (Stop-the-world) - Stop all Java threads, marking all the objects directly reachable from the roots, and restarting the Java threads.
2. Mark-Phase (Concurrent) - Start scanning from marked objects and transitively mark all objects reachable from the roots. The mutators are executing during the concurrent phases 2,3 and 5 below and any objects allocated in the CMS generation during these phases (including promoted objects) are immediately marked as live.
3. Pre-cleaning Phase (Concurrent) - During Phase 2, mutators may be modifying objects. Any object that has been modified since the start of the concurrent marking phase (and which was not subsequently scanned during that phase) must be rescanned. This phase scans objects that have been modified concurrently. Due to continuing mutator activity, the scanning for modified cards may be done multiple times.
4. Final Checkpoint or Remark-Phase (Stop-the-world) - Also called Stop-the-world phase. With mutators stopped the final marking is done by scanning objects reachable from the roots and by scanning any modified objects. Note that after this phase there may be objects that have been marked but are no longer live. Such objects will survive the current collection but will be collected on the next collection.
5. Sweep-Phase (Concurrent) - Collects dead objects. The collection of ad dead object adds the space for the object to a free list for later allocation. Coalescing of dead objects may occur at this point. Note that live objects are not moved.
6. Reset-Phase (Concurrent) - Clears data structures in preparation for the next collection.

Normally the concurrent low pause collector does not copy or compact the live objects. In 1.4.2, if fragmentation in the tenured generation becomes a problem, a compaction of the tenured generation will be done although not concurrently.

Observations

1. Continuous CMS runs
The CMS logs indicated that CMS-initial-mark i.e. beginning of tenured generation collection triggers when the tenured generation space reaches 54% of capacity ( could this be triggered when PermGen space is full? – Update: Understanding is that when using CMS, perm gen collection is turned off by default and a serial Full FC collects the perm gen). The sweep is unable to clear tenured space, hence this cycle is repeated number of times.

2. Concurrent mode failure
Full GCs have resulted in concurrent mode failure (snippet from the log below). Here PermGen had touched max but during full GC, was cleared [CMS Perm : 131009K->27464K(131072K)]. However, the tenured space which was at 470672 has only fallen to 409399 : [CMS (concurrent mode failure): 470672K->409399K(1048576K), 26.7835924 secs]

Snippet from the log indicating 'Concurrent mode failure'
Application time: 21.5362235 seconds
198867.489: [Full GC 198867.489: [ParNew
Desired survivor size 48306584 bytes, new threshold 10 (max 10)
- age   1:    2517240 bytes,    2517240 total
- age   2:    2457600 bytes,    4974840 total
- age   3:     336264 bytes,    5311104 total
- age   4:     418808 bytes,    5729912 total
- age   5:    3270656 bytes,    9000568 total
- age   6:    2665920 bytes,   11666488 total
- age   7:    2297312 bytes,   13963800 total
- age   8:     951136 bytes,   14914936 total
- age   9:    1267616 bytes,   16182552 total
- age  10:     469848 bytes,   16652400 total
: 305816K->16323K(471872K), 0.1235504 secs]198867.613: [CMS (concurrent mode failure): 470672K->409399K(1048576K), 26.7835924 secs] 775463K->409399K(1520448K), [CMS Perm : 131009K->27464K(131072K)], 26.9079389 secs]
Total time for which application threads were stopped: 26.9141975 seconds

3. Promotion Failed
This shows that a ParNew collection was requested, but was not attempted. The reason is that it was estimated that there was not enough space in the CMS generation to promote the worst-case surviving young generation objects. This failure is termed a "full promotion guarantee failure".
As a result, the concurrent mode of CMS is interrupted and a full GC invoked.
(Source: http://www.sun.com/bigadmin/content/submitted/cms_gc_logs.jsp)

Snippet from the log indicating 'Promotion Failed'
803895.914: [GC 803895.914: [ParNew (promotion failed)
Desired survivor size 48306584 bytes, new threshold 7 (max 10)
- age   1:   18086800 bytes,   18086800 total
- age   2:    1270960 bytes,   19357760 total
- age   3:    4698768 bytes,   24056528 total
- age   4:   19146288 bytes,   43202816 total
- age   5:    2225224 bytes,   45428040 total
- age   6:    2812752 bytes,   48240792 total
- age   7:    1029632 bytes,   49270424 total
- age   8:     805968 bytes,   50076392 total
- age   9:     930080 bytes,   51006472 total
- age  10:    1159576 bytes,   52166048 total
: 462419K ->462419K(471872K), 0.5391725 secs]803896.453: [CMS803899.534: [CMS-concurrent-mark: 13.208/14.159 secs]
 (concurrent mode failure): 1030597K->1048536K(1048576K), 35.8992699 secs] 1491791K->1055004K(1520448K), 36.4391620 secs]
Total time for which application threads were stopped: 36.4773697 seconds

803936.523: [Full GC 803936.523: [ParNew: 419456K->419456K(471872K), 0.0000468 secs]803936.523: [CMS803946.382: [CMS-concurrent-mark: 10.841/10.968 secs]
 (concurrent mode failure): 1048536K->1044407K(1048576K), 35.0106907 secs] 1467992K->1044407K(1520448K), [CMS Perm : 28343K->28337K(131072K)], 35.0114078 secs]
Total time for which application threads were stopped: 35.1389167 seconds

Recommendations

1. As per the java sun forums, the concurrent mode failure could occur for 2 reasons (listed below).

• If the collector is unable to finish reclaiming the unreachable objects before the tenured generation fills up – In our case, tenured generation available is 1048576 and it is less than half full

• If allocation cannot be satisfied with the available free space blocks in the tenured generation.
(Source: http://www.sun.com/bigadmin/content/submitted/cms_gc_logs.jsp )

If the concurrent collector is unable to finish reclaiming the unreachable objects before the tenured generation fills up, or if an allocation cannot be satisfied with the available free space blocks in the tenured generation, then the application is paused and the collection is completed with all the application threads stopped. This suggests that even if the logs indicate concurrent mode failure, the collection should have completed. But in this case, the tenured space could not be cleared – indicates that all the objects in tenured space are live objects.

2. Try each of the options listed below in turn.

• Try a larger total heap and/or smaller young generation. But often it just delays the problem.

• Make the application do a full, compacting collection at a time which will not disturb users. If the application can go for a day without hitting a fragmentation problem, try a System.gc() in the middle of the night. That will compact the heap and we can hopefully go another day without hitting the fragmentation problem.

• If most of the data in the tenured generation is read in when the application first starts up and a System.gc() can be done after complete initialization, that might help by compacting all data into a single chunk leaving the rest of the tenured generation available for promotions. Not true in our case.

• Start the concurrent collections earlier. The low pause collector tries to start a concurrent collection just in time (with some safety factor) to collect the tenured generation before it is full. Try starting a concurrent collection sooner so that it finishes before the fragmentation becomes a problem. The concurrent collections don't do a compaction, but they do coalese adjacent free blocks so larger chunks of free space can result from a concurrent collection. One of the triggers for starting a concurrent collection is the amount of free space in the tenured generation. You can cause a concurrent collection to occur early by setting the option -XX:CMSInitiatingOccupancyFraction= where NNN is the percentage of the tenured generation that is in use above which a concurrent collection is started. This will increase the overall time spent doing GC but may avoid the fragmentation problem. But his will be more effective with 5.0 because a single contiguous chunk of space is not required for promotions.
(Source: http://blogs.sun.com/jonthecollector/entry/when_the_sum_of_the)

This post will be updated based on results from further tests shortly.

Analyze heap dumps

The first step towards analyzing a memory leak in Java is to pull out the heap dump when full GC is in progress and then run the heap dump past any memory analyzer to uncover the memory leaks. Will be covering reasons for continuous full GCs in the next post.

To Generate Heap Dump

1. The first and preferred option is to modify the JVM settings to include the following 2 options. Ensure that your production startup scripts have these options in place.
-XX:+HeapDumpOnOutOfMemoryError
-XX:+HeapDumpOnCtrlBreak
Type kill -3 <pid> to generate the hprof file in the bin folder. The 2 commands listed above are non-intrusive and a heap dump is only generated either during an OOME or on demand. It is also considerably lighter on resources as compared to JMAP.

2.  Use JMAP - Java 5 (and later versions of Jdk 1.4 starting Jdk 1.4.2_18) ships with a tool called JMAP. It attaches itself to the JVM and obtains heap layout information, class histograms and complete heap snapshots. Heap layout information is instantly retrieved, with no impact on the running application. However, taking histograms and heap snapshots take longer, and also affect memory / CPU of the application, resulting in either slow response times or complete stalling of the application. So, schedule this activity when the system load is low.

Commands to be used
ps -ef | grep <username> to get the list of PIDs for all the processes running on a system
jmap <pid> > out.txt - This command will print out the same information as pmap
jmap –heap <pid> >> out.txt - Prints out java heap summary
jmap –heap:format=b <pid> >> out.txt - Prints out java heap in hprof binary format (generally called heap.bin) in the current directory
jmap –histo <pid> >> out.txt - Prints out histogram of java object heap

Next run the generated heap dump past one of the following tools

1. Eclipse Memory Analyzer
Download the MAT from http://www.eclipse.org/mat/downloads.php. Open the hprof dump generated using jmap or HeapDumpOnCtrlBreak. MAT parses the dump and generates a visual rep which gives a break down in terms of leak suspects, components and consumers.

On opening the heap dump, you see an info page with a chart of the biggest objects, and in many cases you will notice a single huge object already here. Click on the "Leak Suspects" link of the Overview and drilldown through the HTML report produced. Further reading

2. JHat which ships with Java 6 (Mustang) - To run the tool, use the command
jhat -J-mx512m -stack false heap.bin

Axis 2 and Transfer-Encoding Chunked

We upgraded one of our clients from Axis 1.2 to Axis 2 and it was observed that calls to the downstream system started failing. Checked with the backend guys and they confirmed that our IP address showed up in their access logs, so the handshake had succeeded. But the request xml was not received at their end.

The first step was to plug in tcpmon and trace the data sent over the wire. (TCPMon is a utility that allows the user to monitor the messages passed along in TCP based conversation). If Eclipse is your preferred IDE, download up the TCPMon Eclipse plugin, copy the same to the ECLIPSE_HOME/plugins directory, go to Window -> Show View -> Other -> TCP Monitor -> TCP Monitor and it will open the TCPMon View. Change the endpoint to the TCP Listener endpoint and configure the actual downstream endpoint in TCPMon.

Logs indicated that Transfer-Encoding was being sent as "chunked". Now this was different from the Axis 1.2 logs which had Content-Length set in the HTTP header. Reading up the Axis2 documentation threw some more light on the topic.

By default Apache Axis 1.2 uses 'org.apache.axis.transport.http.HTTPSender' and this in turn uses HTTP/1.0, hence the default behaviour is to set Content-Length. But with Axis 2 which uses HTTP/1.1, the default is Transfer-Encoding set to "chunked". When the data is sent in chunked format, there are some special characters introduced in the payload (refer the snippet below). If the downstream system is a legacy system which does not support HTTP 1.1 (as is ours) and is unable to parse the message due to these additional characters, it returns a Soap Fault Exception.

The solution to this is to disable HTTP chunking as follows.
options.setProperty(org.apache.axis2.transport.http.HTTPConstants.CHUNKED, Boolean.FALSE);

That said, if this issue is caught in the development stage, check if the team hosting the web service can support HTTP Chunking, since it is the preferred approach, especially when dealing with large messages like messages with attachments (MTOM).


HTTP Header using Axis 2
POST /test HTTP/1.1
Content-Type: text/xml; charset=UTF-8
SOAPAction: "http://test.com/wsdl/07/08/10#testDiagnosticRequest"
User-Agent: Axis2
Host: 127.0.0.1:8086
Transfer-Encoding: chunked
1aa









"
0

HTTP Header using Axis 1.2
Content-Type: text/xml; charset=utf-8
Accept: application/soap+xml, application/dime, multipart/related, text/*
User-Agent: Axis/1.2.1
Host: 127.0.0.1:8086
Cache-Control: no-cache
Pragma: no-cache
SOAPAction: "urn:GAT__Inbox/OpCreate"
Content-Length: 6698


...