TechRamblings: Java

Showing posts with label Java. Show all posts

Monday, December 27, 2010

Java DNS Cache

This one is for the record - an issue I had faced and meant to document, so here goes.

From our app (Java version used is 1.4.2_19), we connect to a GSS load balanced url which operates in a round-robin fashion. For e.g., if the url is http://www.twitter.com, which maps to 128.242.240.148/116/20, the app should get a different IP each time or at least most of the time. But it sticks to the first IP it discovers. The reason is Java's DNS cache.

Java caches DNS settings after the app accesses a url so that any subsequent operation uses the cached setting - in our case, if www.twitter.com resolves to 128.242.240.20 the first time, that value is stored in the DNS cache and everytime twitter is accessed, the IP it resolves to remains the same.

Sounds daft, right. Not quite. This cache is maintained in the InetAddress class - there is one each for successful and unsuccessful host name resolutions. The positive caching is there to guard against DNS spoofing attacks; while the negative caching is used to improve performance. By default, the result of positive host name resolutions are cached forever, because there is no general rule to decide when it is safe to remove cache entries. The result of unsuccessful host name resolution is cached for a very short period of time (10 seconds) to improve performance. Refer the javadoc for more on the above.

In cases like ours, where we don't want DNS caching forever, the solution is to set the TTL value for positive caching to a very low value. This much is well documented. There are 3 ways to set this.
- Set the value from command line using the setting -Dnetworkaddress.cache.ttl=x where x is the number of seconds for which the value is to be cached.
- Use the sun property setting. Works the same as above, only use -Dsun.net.inetaddr.ttl=x in lieu of -Dnetworkaddress.cache.ttl
- Modify the java.security file (Path is $JAVA_HOME/jre/lib/security/java.security) to set the value of networkaddress.cache.ttl to a low value. By default the setting is -1 (cache forever). Note that this setting would affect any application which uses this JDK.

Tried Option 1 - added the -Dnetworkaddress.cache.ttl=60 setting to the weblogic startup file, which should have cleared the cache every 60 seconds. But it did not have the intended effect. There is a related bug 6247501 reported, but that's Windows specific and we were on Unix, so shouldn't have affected us. Looked through the InetAddress caching mechanism, but could not quite figure out why it failed. Gave up after an hour.

Modified the JVM to use the sun setting -Dsun.net.inetaddr.ttl=60 and restarted the domain. Worked like a charm.

Can only conclude that -Dnetworkaddress.cache.ttl does not work from the command line but only from within the java.security file. If you do not wish to change the java.security setting, which would be the case if multiple domains reference the same JDK, a safer option is to use the -Dsun.net.inetaddr.ttl setting from the command line.

Happy caching.

Eclipse Memory Issues

JDK 5 was the default Java version in use, upgraded to JDK1.6.0_21-b06 today and Eclipse started tanking. Starts up fine but after a few minutes would fail with the error 'Internal plug-in action delegate error on creation. PermGen space'. Right, seemed to indicate that I wasn't allocating enough memory for permgen. But the same program works fine with the very same memory settings under JDK 5, so this was definitely an upgrade problem. So much for my support of auto updates :-|

The default setting in eclipse.ini was
--launcher.XXMaxPermSize
256m
-vmargs
-Dosgi.requiredJavaVersion=1.5
-Xms40m
-Xmx256m

And 256m is generally sufficient for permgen, so what gives. Checked the configuration file to confim if the settings had taken effect (Click on Help -> About Eclipse -> Installation Details -> Configuration). Shows up the Xms and Xmx, but no MaxPermGen, hmmm... interesting.
eclipse.vmargs=-Dosgi.requiredJavaVersion=1.5
-Xms40m
-Xmx256m

Googled a bit. The eclipse wiki indicated that it was a bug with Oracle/Sun JDK 1.6.0_21 (had to be my version, duhhh), being tracked as 319514. The bug link is a pretty interesting read.

Apparently, as part of Oracle's rebranding of Sun's products, the Company Name property of the java.exe file, the executable file containing Oracle's JRE for Windows, was updated from "Sun Microsystems" to "Oracle" in Java SE 6u21. Now on Windows, Sun VM is identified using the GetFileVersionInfo API, which reads the company name (present under version details) from jvm.dll or the java executable and compares it against the string "Sun Microsystems". Post update 21, the company name was "Oracle" and the launcher does not recognize this string, hence the -XX:MaxPermSize setting is not honoured.

The workarounds are
- Switch back to '1.6.0_20'
- Change the commandline for launching or add the following line after "-vmargs" to your Eclipse.ini file:-XX:MaxPermSize=256m
- For 32-bit Helios, download the fixed eclipse_1308.dll and place it into (eclipse_home)/plugins/org.eclipse.equinox.launcher.win32.win32.x86_1.1.0.v20100503
- Download and install any of the upgraded versions i.e. version 1.6.0_21-b07 or higher from the java site (alternative link is http://java.sun.com/javase/downloads/index.jsp). Make sure you have b07 or higher by running java -version.

Went with the last option and so far life's good. The MaxPermSize setting shows up under vm options
eclipse.vmargs=-Dosgi.requiredJavaVersion=1.5
-Xms40m
-Xmx256m
-XX:MaxPermSize=256m

But got this bad feeling that when I move to JDK 7, there'll be a newer set of problems cropping up (From the Oracle site: In consideration to Eclipse and other potentially affected users, Oracle has restored the Windows Company Name property value to "Sun Microsystems" in further JDK 6 updates. This value will be changed back to "Oracle" in JDK 7.) Sigh, so much for compatibility.

Saturday, May 29, 2010

ConcurrentModeFailure on Full GCs - An analysis

Hit a problem on our production env. - summarized below.

We have 1 admin + 3 agent processes (all Java processes) running per box, across 16 boxes. Over the past couple of weeks, it has been observed that on the 6th or 7th day after appln restart, multiple full GCs occur (sometimes pushing 300+ in a day). This is accompanied by decreased throughput and increase in CPU usage. One point worth mentioning is that the framework we use has a heavy multi-threading mechanism.

Specs are
OS: SunOS Rel.5.10
Number of CPUs: 32
System: Sun Sparc
JDK Version: JDK 1.4.2_18

JVM settings in use are

Performance Options

-Xmx1536m -Xms1536m
Min and Max Heap Size - Excludes PermGen space. Setting -Xms and -Xmx to the same value increases predictability by removing the most important sizing decision from the virtual machine.

-XX:NewSize=512m -XX:MaxNewSize=512m
Min and Max Size of New Generation

-XX:PermSize=64m -XX:MaxPermSize=128m
Min and Max Size of Permanent Generation which holds the JVM's class and method objects. It is recommended to set PermSize equal to MaxPermSize to avoid heap resizing when size of PermGen increases. This is one change we plan to do i.e. Set -XX:PermSize to 128m.

-XX:SurvivorRatio=8 
Young generation space is divided into eden and 2 SSs
The size of each is calculated as
Eden = NewSize - (NewSize/(SurvivorRatio + 2)) * 2) = 409.6MB
From space = (NewSize - Eden) / 2 = 51.2MB
To space = (NewSize - Eden) / 2 = 51.2MB
Old gen space = 1536 - 512 = 1014MB

-XX:TargetSurvivorRatio=90
Tenuring threshold that is used to copy the tenured objects in the young generation. In this case, allows 90% of the survivor spaces to be occupied instead of the default 50%

-XX:ParallelGCThreads=4
Default value is equal to number of CPUs. Optimised in our case to 4

-XX:MaxTenuringThreshold=10 
The number of times an oject is aged in the young generation

Debugging Options

-XX:+PrintGCDetails    
-XX:+PrintGCTimeStamps 
-XX:+PrintTenuringDistribution 
-XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintGCApplicationStoppedTime 

Behavioral Options

-XX:+UseConcMarkSweepGC
Use concurrent mark-sweep collection for the old generation

-XX:+DisableExplicitGC    
Disable calls to System.gc()

-XX:+UseParNewGC          
Use multiple threads for parallel young generation collection

-XX:-UseAdaptiveSizePolicy 
Specifically disable UseAdaptiveSizePolicy since it should be used with UseParallelGC, not UseParNewGC

-Xloggc:/log/file1.gc
Log output file name

A word about how the generations are structured before we progress.

The young generation is eden + 2 survivor spaces. An object when created is allocated in eden. At any given point of time, one SS is empty and serves as a destination of the next. During a minor collection, the living objects in eden are copied to the first SS. If all the living objects in eden do not fit in one SS, the remaining living objecs are promoted into the old generation. During the next minor collection, the living objects from eden and the first SS are copied to the second SS.

As more objects become tenured, the old object space begins to reach maximum occupancy. The GC algo in use is mark-compact collection. It scans all objects, marking all reachable objects, then compacts all remaining gaps of dead objects. The advantage this algo has over copy collect is that it requires less memory and eliminates memory fragmentation.

Also, the concurrent low pause collector starts a collection when the occupancy of the tenured generation reaches a specified value (by default 68%). It attempts to minimize the pauses due to garbage collection by doing most of the garbage collection work concurrently with the application threads.

The concurrent collector uses a background thread that runs concurrently with the application threads to enable both GC and object handling at the same time. The collector collects garbage in phases, two are stop-the-world phases, and four are concurrent i.e. they run with app threads. The phases are:

1. Initial-Mark-Phase (Stop-the-world) - Stop all Java threads, marking all the objects directly reachable from the roots, and restarting the Java threads.
2. Mark-Phase (Concurrent) - Start scanning from marked objects and transitively mark all objects reachable from the roots. The mutators are executing during the concurrent phases 2,3 and 5 below and any objects allocated in the CMS generation during these phases (including promoted objects) are immediately marked as live.
3. Pre-cleaning Phase (Concurrent) - During Phase 2, mutators may be modifying objects. Any object that has been modified since the start of the concurrent marking phase (and which was not subsequently scanned during that phase) must be rescanned. This phase scans objects that have been modified concurrently. Due to continuing mutator activity, the scanning for modified cards may be done multiple times.
4. Final Checkpoint or Remark-Phase (Stop-the-world) - Also called Stop-the-world phase. With mutators stopped the final marking is done by scanning objects reachable from the roots and by scanning any modified objects. Note that after this phase there may be objects that have been marked but are no longer live. Such objects will survive the current collection but will be collected on the next collection.
5. Sweep-Phase (Concurrent) - Collects dead objects. The collection of ad dead object adds the space for the object to a free list for later allocation. Coalescing of dead objects may occur at this point. Note that live objects are not moved.
6. Reset-Phase (Concurrent) - Clears data structures in preparation for the next collection.

Normally the concurrent low pause collector does not copy or compact the live objects. In 1.4.2, if fragmentation in the tenured generation becomes a problem, a compaction of the tenured generation will be done although not concurrently.

Observations

1. Continuous CMS runs
The CMS logs indicated that CMS-initial-mark i.e. beginning of tenured generation collection triggers when the tenured generation space reaches 54% of capacity ( could this be triggered when PermGen space is full? – Update: Understanding is that when using CMS, perm gen collection is turned off by default and a serial Full FC collects the perm gen). The sweep is unable to clear tenured space, hence this cycle is repeated number of times.

2. Concurrent mode failure
Full GCs have resulted in concurrent mode failure (snippet from the log below). Here PermGen had touched max but during full GC, was cleared [CMS Perm : 131009K->27464K(131072K)]. However, the tenured space which was at 470672 has only fallen to 409399 : [CMS (concurrent mode failure): 470672K->409399K(1048576K), 26.7835924 secs]

Snippet from the log indicating 'Concurrent mode failure'

Application time: 21.5362235 seconds
198867.489: [Full GC 198867.489: [ParNew
Desired survivor size 48306584 bytes, new threshold 10 (max 10)
- age   1:    2517240 bytes,    2517240 total
- age   2:    2457600 bytes,    4974840 total
- age   3:     336264 bytes,    5311104 total
- age   4:     418808 bytes,    5729912 total
- age   5:    3270656 bytes,    9000568 total
- age   6:    2665920 bytes,   11666488 total
- age   7:    2297312 bytes,   13963800 total
- age   8:     951136 bytes,   14914936 total
- age   9:    1267616 bytes,   16182552 total
- age  10:     469848 bytes,   16652400 total
: 305816K->16323K(471872K), 0.1235504 secs]198867.613: [CMS (concurrent mode failure): 470672K->409399K(1048576K), 26.7835924 secs] 775463K->409399K(1520448K), [CMS Perm : 131009K->27464K(131072K)], 26.9079389 secs]
Total time for which application threads were stopped: 26.9141975 seconds

3. Promotion Failed
This shows that a ParNew collection was requested, but was not attempted. The reason is that it was estimated that there was not enough space in the CMS generation to promote the worst-case surviving young generation objects. This failure is termed a "full promotion guarantee failure".
As a result, the concurrent mode of CMS is interrupted and a full GC invoked.
(Source: http://www.sun.com/bigadmin/content/submitted/cms_gc_logs.jsp)

Snippet from the log indicating 'Promotion Failed'

803895.914: [GC 803895.914: [ParNew (promotion failed)
Desired survivor size 48306584 bytes, new threshold 7 (max 10)
- age   1:   18086800 bytes,   18086800 total
- age   2:    1270960 bytes,   19357760 total
- age   3:    4698768 bytes,   24056528 total
- age   4:   19146288 bytes,   43202816 total
- age   5:    2225224 bytes,   45428040 total
- age   6:    2812752 bytes,   48240792 total
- age   7:    1029632 bytes,   49270424 total
- age   8:     805968 bytes,   50076392 total
- age   9:     930080 bytes,   51006472 total
- age  10:    1159576 bytes,   52166048 total
: 462419K ->462419K(471872K), 0.5391725 secs]803896.453: [CMS803899.534: [CMS-concurrent-mark: 13.208/14.159 secs]
 (concurrent mode failure): 1030597K->1048536K(1048576K), 35.8992699 secs] 1491791K->1055004K(1520448K), 36.4391620 secs]
Total time for which application threads were stopped: 36.4773697 seconds

803936.523: [Full GC 803936.523: [ParNew: 419456K->419456K(471872K), 0.0000468 secs]803936.523: [CMS803946.382: [CMS-concurrent-mark: 10.841/10.968 secs]
 (concurrent mode failure): 1048536K->1044407K(1048576K), 35.0106907 secs] 1467992K->1044407K(1520448K), [CMS Perm : 28343K->28337K(131072K)], 35.0114078 secs]
Total time for which application threads were stopped: 35.1389167 seconds

Recommendations

1. As per the java sun forums, the concurrent mode failure could occur for 2 reasons (listed below).

• If the collector is unable to finish reclaiming the unreachable objects before the tenured generation fills up – In our case, tenured generation available is 1048576 and it is less than half full

• If allocation cannot be satisfied with the available free space blocks in the tenured generation.
(Source: http://www.sun.com/bigadmin/content/submitted/cms_gc_logs.jsp )

If the concurrent collector is unable to finish reclaiming the unreachable objects before the tenured generation fills up, or if an allocation cannot be satisfied with the available free space blocks in the tenured generation, then the application is paused and the collection is completed with all the application threads stopped. This suggests that even if the logs indicate concurrent mode failure, the collection should have completed. But in this case, the tenured space could not be cleared – indicates that all the objects in tenured space are live objects.

2. Try each of the options listed below in turn.

• Try a larger total heap and/or smaller young generation. But often it just delays the problem.

• Make the application do a full, compacting collection at a time which will not disturb users. If the application can go for a day without hitting a fragmentation problem, try a System.gc() in the middle of the night. That will compact the heap and we can hopefully go another day without hitting the fragmentation problem.

• If most of the data in the tenured generation is read in when the application first starts up and a System.gc() can be done after complete initialization, that might help by compacting all data into a single chunk leaving the rest of the tenured generation available for promotions. Not true in our case.

• Start the concurrent collections earlier. The low pause collector tries to start a concurrent collection just in time (with some safety factor) to collect the tenured generation before it is full. Try starting a concurrent collection sooner so that it finishes before the fragmentation becomes a problem. The concurrent collections don't do a compaction, but they do coalese adjacent free blocks so larger chunks of free space can result from a concurrent collection. One of the triggers for starting a concurrent collection is the amount of free space in the tenured generation. You can cause a concurrent collection to occur early by setting the option -XX:CMSInitiatingOccupancyFraction= where NNN is the percentage of the tenured generation that is in use above which a concurrent collection is started. This will increase the overall time spent doing GC but may avoid the fragmentation problem. But his will be more effective with 5.0 because a single contiguous chunk of space is not required for promotions.
(Source: http://blogs.sun.com/jonthecollector/entry/when_the_sum_of_the)

This post will be updated based on results from further tests shortly.

Analyze heap dumps

The first step towards analyzing a memory leak in Java is to pull out the heap dump when full GC is in progress and then run the heap dump past any memory analyzer to uncover the memory leaks. Will be covering reasons for continuous full GCs in the next post.

To Generate Heap Dump

1. The first and preferred option is to modify the JVM settings to include the following 2 options. Ensure that your production startup scripts have these options in place.

-XX:+HeapDumpOnOutOfMemoryError
-XX:+HeapDumpOnCtrlBreak

Type kill -3 <pid> to generate the hprof file in the bin folder. The 2 commands listed above are non-intrusive and a heap dump is only generated either during an OOME or on demand. It is also considerably lighter on resources as compared to JMAP.

2. Use JMAP - Java 5 (and later versions of Jdk 1.4 starting Jdk 1.4.2_18) ships with a tool called JMAP. It attaches itself to the JVM and obtains heap layout information, class histograms and complete heap snapshots. Heap layout information is instantly retrieved, with no impact on the running application. However, taking histograms and heap snapshots take longer, and also affect memory / CPU of the application, resulting in either slow response times or complete stalling of the application. So, schedule this activity when the system load is low.

Commands to be used
ps -ef | grep <username> to get the list of PIDs for all the processes running on a system
jmap <pid> > out.txt - This command will print out the same information as pmap
jmap –heap <pid> >> out.txt - Prints out java heap summary
jmap –heap:format=b <pid> >> out.txt - Prints out java heap in hprof binary format (generally called heap.bin) in the current directory
jmap –histo <pid> >> out.txt - Prints out histogram of java object heap

Next run the generated heap dump past one of the following tools

1. Eclipse Memory Analyzer
Download the MAT from http://www.eclipse.org/mat/downloads.php. Open the hprof dump generated using jmap or HeapDumpOnCtrlBreak. MAT parses the dump and generates a visual rep which gives a break down in terms of leak suspects, components and consumers.

On opening the heap dump, you see an info page with a chart of the biggest objects, and in many cases you will notice a single huge object already here. Click on the "Leak Suspects" link of the Overview and drilldown through the HTML report produced. Further reading

2. JHat which ships with Java 6 (Mustang) - To run the tool, use the command

jhat -J-mx512m -stack false heap.bin

Wednesday, August 13, 2008

Down in the dump with thread dumps

Had used samurai nearly a year ago, when we were faced with some stuck thread issues. Was talking to a friend about analyzing thread dumps, and I collected these links.

A summary on tools available
http://weblogs.java.net/blog/bhaktimehta/archive/2007/05/obtaining_and_a.html
Analysis
http://weblogic.sys-con.com/node/44027
Tool urls
Samurai - http://yusuke.homeip.net/samurai/?english - Download samurai.jar, set it using jws and get going.
TDA - http://www.jroller.com/dumpster/entry/tda_thread_dump_analyzer_1