JDK 5 was the default Java version in use, upgraded to JDK1.6.0_21-b06 today and Eclipse started tanking. Starts up fine but after a few minutes would fail with the error 'Internal plug-in action delegate error on creation. PermGen space'. Right, seemed to indicate that I wasn't allocating enough memory for permgen. But the same program works fine with the very same memory settings under JDK 5, so this was definitely an upgrade problem. So much for my support of auto updates :-|
The default setting in eclipse.ini was
--launcher.XXMaxPermSize
256m
-vmargs
-Dosgi.requiredJavaVersion=1.5
-Xms40m
-Xmx256m
And 256m is generally sufficient for permgen, so what gives. Checked the configuration file to confim if the settings had taken effect (Click on Help -> About Eclipse -> Installation Details -> Configuration). Shows up the Xms and Xmx, but no MaxPermGen, hmmm... interesting.
eclipse.vmargs=-Dosgi.requiredJavaVersion=1.5
-Xms40m
-Xmx256m
Googled a bit. The eclipse wiki indicated that it was a bug with Oracle/Sun JDK 1.6.0_21 (had to be my version, duhhh), being tracked as 319514. The bug link is a pretty interesting read.
Apparently, as part of Oracle's rebranding of Sun's products, the Company Name property of the java.exe file, the executable file containing Oracle's JRE for Windows, was updated from "Sun Microsystems" to "Oracle" in Java SE 6u21. Now on Windows, Sun VM is identified using the GetFileVersionInfo API, which reads the company name (present under version details) from jvm.dll or the java executable and compares it against the string "Sun Microsystems". Post update 21, the company name was "Oracle" and the launcher does not recognize this string, hence the -XX:MaxPermSize setting is not honoured.
The workarounds are
- Switch back to '1.6.0_20'
- Change the commandline for launching or add the following line after "-vmargs" to your Eclipse.ini file:-XX:MaxPermSize=256m
- For 32-bit Helios, download the fixed eclipse_1308.dll and place it into (eclipse_home)/plugins/org.eclipse.equinox.launcher.win32.win32.x86_1.1.0.v20100503
- Download and install any of the upgraded versions i.e. version 1.6.0_21-b07 or higher from the java site (alternative link is http://java.sun.com/javase/downloads/index.jsp). Make sure you have b07 or higher by running java -version.
Went with the last option and so far life's good. The MaxPermSize setting shows up under vm options
eclipse.vmargs=-Dosgi.requiredJavaVersion=1.5
-Xms40m
-Xmx256m
-XX:MaxPermSize=256m
But got this bad feeling that when I move to JDK 7, there'll be a newer set of problems cropping up (From the Oracle site: In consideration to Eclipse and other potentially affected users, Oracle has restored the Windows Company Name property value to "Sun Microsystems" in further JDK 6 updates. This value will be changed back to "Oracle" in JDK 7.) Sigh, so much for compatibility.
Showing posts with label memory. Show all posts
Showing posts with label memory. Show all posts
Monday, December 27, 2010
Saturday, May 29, 2010
ConcurrentModeFailure on Full GCs - An analysis
Hit a problem on our production env. - summarized below.
We have 1 admin + 3 agent processes (all Java processes) running per box, across 16 boxes. Over the past couple of weeks, it has been observed that on the 6th or 7th day after appln restart, multiple full GCs occur (sometimes pushing 300+ in a day). This is accompanied by decreased throughput and increase in CPU usage. One point worth mentioning is that the framework we use has a heavy multi-threading mechanism.
Specs are
OS: SunOS Rel.5.10
Number of CPUs: 32
System: Sun Sparc
JDK Version: JDK 1.4.2_18
JVM settings in use are
A word about how the generations are structured before we progress.
The young generation is eden + 2 survivor spaces. An object when created is allocated in eden. At any given point of time, one SS is empty and serves as a destination of the next. During a minor collection, the living objects in eden are copied to the first SS. If all the living objects in eden do not fit in one SS, the remaining living objecs are promoted into the old generation. During the next minor collection, the living objects from eden and the first SS are copied to the second SS.
As more objects become tenured, the old object space begins to reach maximum occupancy. The GC algo in use is mark-compact collection. It scans all objects, marking all reachable objects, then compacts all remaining gaps of dead objects. The advantage this algo has over copy collect is that it requires less memory and eliminates memory fragmentation.
Also, the concurrent low pause collector starts a collection when the occupancy of the tenured generation reaches a specified value (by default 68%). It attempts to minimize the pauses due to garbage collection by doing most of the garbage collection work concurrently with the application threads.
The concurrent collector uses a background thread that runs concurrently with the application threads to enable both GC and object handling at the same time. The collector collects garbage in phases, two are stop-the-world phases, and four are concurrent i.e. they run with app threads. The phases are:
1. Initial-Mark-Phase (Stop-the-world) - Stop all Java threads, marking all the objects directly reachable from the roots, and restarting the Java threads.
2. Mark-Phase (Concurrent) - Start scanning from marked objects and transitively mark all objects reachable from the roots. The mutators are executing during the concurrent phases 2,3 and 5 below and any objects allocated in the CMS generation during these phases (including promoted objects) are immediately marked as live.
3. Pre-cleaning Phase (Concurrent) - During Phase 2, mutators may be modifying objects. Any object that has been modified since the start of the concurrent marking phase (and which was not subsequently scanned during that phase) must be rescanned. This phase scans objects that have been modified concurrently. Due to continuing mutator activity, the scanning for modified cards may be done multiple times.
4. Final Checkpoint or Remark-Phase (Stop-the-world) - Also called Stop-the-world phase. With mutators stopped the final marking is done by scanning objects reachable from the roots and by scanning any modified objects. Note that after this phase there may be objects that have been marked but are no longer live. Such objects will survive the current collection but will be collected on the next collection.
5. Sweep-Phase (Concurrent) - Collects dead objects. The collection of ad dead object adds the space for the object to a free list for later allocation. Coalescing of dead objects may occur at this point. Note that live objects are not moved.
6. Reset-Phase (Concurrent) - Clears data structures in preparation for the next collection.
Normally the concurrent low pause collector does not copy or compact the live objects. In 1.4.2, if fragmentation in the tenured generation becomes a problem, a compaction of the tenured generation will be done although not concurrently.
Observations
1. Continuous CMS runs
The CMS logs indicated that CMS-initial-mark i.e. beginning of tenured generation collection triggers when the tenured generation space reaches 54% of capacity ( could this be triggered when PermGen space is full? – Update: Understanding is that when using CMS, perm gen collection is turned off by default and a serial Full FC collects the perm gen). The sweep is unable to clear tenured space, hence this cycle is repeated number of times.
2. Concurrent mode failure
Full GCs have resulted in concurrent mode failure (snippet from the log below). Here PermGen had touched max but during full GC, was cleared [CMS Perm : 131009K->27464K(131072K)]. However, the tenured space which was at 470672 has only fallen to 409399 : [CMS (concurrent mode failure): 470672K->409399K(1048576K), 26.7835924 secs]
Snippet from the log indicating 'Concurrent mode failure'
3. Promotion Failed
This shows that a ParNew collection was requested, but was not attempted. The reason is that it was estimated that there was not enough space in the CMS generation to promote the worst-case surviving young generation objects. This failure is termed a "full promotion guarantee failure".
As a result, the concurrent mode of CMS is interrupted and a full GC invoked.
(Source: http://www.sun.com/bigadmin/content/submitted/cms_gc_logs.jsp)
Snippet from the log indicating 'Promotion Failed'
Recommendations
1. As per the java sun forums, the concurrent mode failure could occur for 2 reasons (listed below).
• If the collector is unable to finish reclaiming the unreachable objects before the tenured generation fills up – In our case, tenured generation available is 1048576 and it is less than half full
• If allocation cannot be satisfied with the available free space blocks in the tenured generation.
(Source: http://www.sun.com/bigadmin/content/submitted/cms_gc_logs.jsp )
If the concurrent collector is unable to finish reclaiming the unreachable objects before the tenured generation fills up, or if an allocation cannot be satisfied with the available free space blocks in the tenured generation, then the application is paused and the collection is completed with all the application threads stopped. This suggests that even if the logs indicate concurrent mode failure, the collection should have completed. But in this case, the tenured space could not be cleared – indicates that all the objects in tenured space are live objects.
2. Try each of the options listed below in turn.
• Try a larger total heap and/or smaller young generation. But often it just delays the problem.
• Make the application do a full, compacting collection at a time which will not disturb users. If the application can go for a day without hitting a fragmentation problem, try a System.gc() in the middle of the night. That will compact the heap and we can hopefully go another day without hitting the fragmentation problem.
• If most of the data in the tenured generation is read in when the application first starts up and a System.gc() can be done after complete initialization, that might help by compacting all data into a single chunk leaving the rest of the tenured generation available for promotions. Not true in our case.
• Start the concurrent collections earlier. The low pause collector tries to start a concurrent collection just in time (with some safety factor) to collect the tenured generation before it is full. Try starting a concurrent collection sooner so that it finishes before the fragmentation becomes a problem. The concurrent collections don't do a compaction, but they do coalese adjacent free blocks so larger chunks of free space can result from a concurrent collection. One of the triggers for starting a concurrent collection is the amount of free space in the tenured generation. You can cause a concurrent collection to occur early by setting the option -XX:CMSInitiatingOccupancyFraction= where NNN is the percentage of the tenured generation that is in use above which a concurrent collection is started. This will increase the overall time spent doing GC but may avoid the fragmentation problem. But his will be more effective with 5.0 because a single contiguous chunk of space is not required for promotions.
(Source: http://blogs.sun.com/jonthecollector/entry/when_the_sum_of_the)
This post will be updated based on results from further tests shortly.
We have 1 admin + 3 agent processes (all Java processes) running per box, across 16 boxes. Over the past couple of weeks, it has been observed that on the 6th or 7th day after appln restart, multiple full GCs occur (sometimes pushing 300+ in a day). This is accompanied by decreased throughput and increase in CPU usage. One point worth mentioning is that the framework we use has a heavy multi-threading mechanism.
Specs are
OS: SunOS Rel.5.10
Number of CPUs: 32
System: Sun Sparc
JDK Version: JDK 1.4.2_18
JVM settings in use are
Performance Options -Xmx1536m -Xms1536m Min and Max Heap Size - Excludes PermGen space. Setting -Xms and -Xmx to the same value increases predictability by removing the most important sizing decision from the virtual machine. -XX:NewSize=512m -XX:MaxNewSize=512m Min and Max Size of New Generation -XX:PermSize=64m -XX:MaxPermSize=128m Min and Max Size of Permanent Generation which holds the JVM's class and method objects. It is recommended to set PermSize equal to MaxPermSize to avoid heap resizing when size of PermGen increases. This is one change we plan to do i.e. Set -XX:PermSize to 128m. -XX:SurvivorRatio=8 Young generation space is divided into eden and 2 SSs The size of each is calculated as Eden = NewSize - (NewSize/(SurvivorRatio + 2)) * 2) = 409.6MB From space = (NewSize - Eden) / 2 = 51.2MB To space = (NewSize - Eden) / 2 = 51.2MB Old gen space = 1536 - 512 = 1014MB -XX:TargetSurvivorRatio=90 Tenuring threshold that is used to copy the tenured objects in the young generation. In this case, allows 90% of the survivor spaces to be occupied instead of the default 50% -XX:ParallelGCThreads=4 Default value is equal to number of CPUs. Optimised in our case to 4 -XX:MaxTenuringThreshold=10 The number of times an oject is aged in the young generation Debugging Options -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime Behavioral Options -XX:+UseConcMarkSweepGC Use concurrent mark-sweep collection for the old generation -XX:+DisableExplicitGC Disable calls to System.gc() -XX:+UseParNewGC Use multiple threads for parallel young generation collection -XX:-UseAdaptiveSizePolicy Specifically disable UseAdaptiveSizePolicy since it should be used with UseParallelGC, not UseParNewGC -Xloggc:/log/file1.gc Log output file name
A word about how the generations are structured before we progress.
The young generation is eden + 2 survivor spaces. An object when created is allocated in eden. At any given point of time, one SS is empty and serves as a destination of the next. During a minor collection, the living objects in eden are copied to the first SS. If all the living objects in eden do not fit in one SS, the remaining living objecs are promoted into the old generation. During the next minor collection, the living objects from eden and the first SS are copied to the second SS.
As more objects become tenured, the old object space begins to reach maximum occupancy. The GC algo in use is mark-compact collection. It scans all objects, marking all reachable objects, then compacts all remaining gaps of dead objects. The advantage this algo has over copy collect is that it requires less memory and eliminates memory fragmentation.
Also, the concurrent low pause collector starts a collection when the occupancy of the tenured generation reaches a specified value (by default 68%). It attempts to minimize the pauses due to garbage collection by doing most of the garbage collection work concurrently with the application threads.
The concurrent collector uses a background thread that runs concurrently with the application threads to enable both GC and object handling at the same time. The collector collects garbage in phases, two are stop-the-world phases, and four are concurrent i.e. they run with app threads. The phases are:
1. Initial-Mark-Phase (Stop-the-world) - Stop all Java threads, marking all the objects directly reachable from the roots, and restarting the Java threads.
2. Mark-Phase (Concurrent) - Start scanning from marked objects and transitively mark all objects reachable from the roots. The mutators are executing during the concurrent phases 2,3 and 5 below and any objects allocated in the CMS generation during these phases (including promoted objects) are immediately marked as live.
3. Pre-cleaning Phase (Concurrent) - During Phase 2, mutators may be modifying objects. Any object that has been modified since the start of the concurrent marking phase (and which was not subsequently scanned during that phase) must be rescanned. This phase scans objects that have been modified concurrently. Due to continuing mutator activity, the scanning for modified cards may be done multiple times.
4. Final Checkpoint or Remark-Phase (Stop-the-world) - Also called Stop-the-world phase. With mutators stopped the final marking is done by scanning objects reachable from the roots and by scanning any modified objects. Note that after this phase there may be objects that have been marked but are no longer live. Such objects will survive the current collection but will be collected on the next collection.
5. Sweep-Phase (Concurrent) - Collects dead objects. The collection of ad dead object adds the space for the object to a free list for later allocation. Coalescing of dead objects may occur at this point. Note that live objects are not moved.
6. Reset-Phase (Concurrent) - Clears data structures in preparation for the next collection.
Normally the concurrent low pause collector does not copy or compact the live objects. In 1.4.2, if fragmentation in the tenured generation becomes a problem, a compaction of the tenured generation will be done although not concurrently.
Observations
1. Continuous CMS runs
The CMS logs indicated that CMS-initial-mark i.e. beginning of tenured generation collection triggers when the tenured generation space reaches 54% of capacity (
2. Concurrent mode failure
Full GCs have resulted in concurrent mode failure (snippet from the log below). Here PermGen had touched max but during full GC, was cleared [CMS Perm : 131009K->27464K(131072K)]. However, the tenured space which was at 470672 has only fallen to 409399 : [CMS (concurrent mode failure): 470672K->409399K(1048576K), 26.7835924 secs]
Snippet from the log indicating 'Concurrent mode failure'
Application time: 21.5362235 seconds 198867.489: [Full GC 198867.489: [ParNew Desired survivor size 48306584 bytes, new threshold 10 (max 10) - age 1: 2517240 bytes, 2517240 total - age 2: 2457600 bytes, 4974840 total - age 3: 336264 bytes, 5311104 total - age 4: 418808 bytes, 5729912 total - age 5: 3270656 bytes, 9000568 total - age 6: 2665920 bytes, 11666488 total - age 7: 2297312 bytes, 13963800 total - age 8: 951136 bytes, 14914936 total - age 9: 1267616 bytes, 16182552 total - age 10: 469848 bytes, 16652400 total : 305816K->16323K(471872K), 0.1235504 secs]198867.613: [CMS (concurrent mode failure): 470672K->409399K(1048576K), 26.7835924 secs] 775463K->409399K(1520448K), [CMS Perm : 131009K->27464K(131072K)], 26.9079389 secs] Total time for which application threads were stopped: 26.9141975 seconds
3. Promotion Failed
This shows that a ParNew collection was requested, but was not attempted. The reason is that it was estimated that there was not enough space in the CMS generation to promote the worst-case surviving young generation objects. This failure is termed a "full promotion guarantee failure".
As a result, the concurrent mode of CMS is interrupted and a full GC invoked.
(Source: http://www.sun.com/bigadmin/content/submitted/cms_gc_logs.jsp)
Snippet from the log indicating 'Promotion Failed'
803895.914: [GC 803895.914: [ParNew (promotion failed) Desired survivor size 48306584 bytes, new threshold 7 (max 10) - age 1: 18086800 bytes, 18086800 total - age 2: 1270960 bytes, 19357760 total - age 3: 4698768 bytes, 24056528 total - age 4: 19146288 bytes, 43202816 total - age 5: 2225224 bytes, 45428040 total - age 6: 2812752 bytes, 48240792 total - age 7: 1029632 bytes, 49270424 total - age 8: 805968 bytes, 50076392 total - age 9: 930080 bytes, 51006472 total - age 10: 1159576 bytes, 52166048 total : 462419K ->462419K(471872K), 0.5391725 secs]803896.453: [CMS803899.534: [CMS-concurrent-mark: 13.208/14.159 secs] (concurrent mode failure): 1030597K->1048536K(1048576K), 35.8992699 secs] 1491791K->1055004K(1520448K), 36.4391620 secs] Total time for which application threads were stopped: 36.4773697 seconds 803936.523: [Full GC 803936.523: [ParNew: 419456K->419456K(471872K), 0.0000468 secs]803936.523: [CMS803946.382: [CMS-concurrent-mark: 10.841/10.968 secs] (concurrent mode failure): 1048536K->1044407K(1048576K), 35.0106907 secs] 1467992K->1044407K(1520448K), [CMS Perm : 28343K->28337K(131072K)], 35.0114078 secs] Total time for which application threads were stopped: 35.1389167 seconds
Recommendations
1. As per the java sun forums, the concurrent mode failure could occur for 2 reasons (listed below).
• If the collector is unable to finish reclaiming the unreachable objects before the tenured generation fills up – In our case, tenured generation available is 1048576 and it is less than half full
• If allocation cannot be satisfied with the available free space blocks in the tenured generation.
(Source: http://www.sun.com/bigadmin/content/submitted/cms_gc_logs.jsp )
If the concurrent collector is unable to finish reclaiming the unreachable objects before the tenured generation fills up, or if an allocation cannot be satisfied with the available free space blocks in the tenured generation, then the application is paused and the collection is completed with all the application threads stopped. This suggests that even if the logs indicate concurrent mode failure, the collection should have completed. But in this case, the tenured space could not be cleared – indicates that all the objects in tenured space are live objects.
2. Try each of the options listed below in turn.
• Try a larger total heap and/or smaller young generation. But often it just delays the problem.
• Make the application do a full, compacting collection at a time which will not disturb users. If the application can go for a day without hitting a fragmentation problem, try a System.gc() in the middle of the night. That will compact the heap and we can hopefully go another day without hitting the fragmentation problem.
• If most of the data in the tenured generation is read in when the application first starts up and a System.gc() can be done after complete initialization, that might help by compacting all data into a single chunk leaving the rest of the tenured generation available for promotions. Not true in our case.
• Start the concurrent collections earlier. The low pause collector tries to start a concurrent collection just in time (with some safety factor) to collect the tenured generation before it is full. Try starting a concurrent collection sooner so that it finishes before the fragmentation becomes a problem. The concurrent collections don't do a compaction, but they do coalese adjacent free blocks so larger chunks of free space can result from a concurrent collection. One of the triggers for starting a concurrent collection is the amount of free space in the tenured generation. You can cause a concurrent collection to occur early by setting the option -XX:CMSInitiatingOccupancyFraction= where NNN is the percentage of the tenured generation that is in use above which a concurrent collection is started. This will increase the overall time spent doing GC but may avoid the fragmentation problem. But his will be more effective with 5.0 because a single contiguous chunk of space is not required for promotions.
(Source: http://blogs.sun.com/jonthecollector/entry/when_the_sum_of_the)
This post will be updated based on results from further tests shortly.
Subscribe to:
Posts (Atom)