TechRamblings: 2010

Monday, December 27, 2010

Java DNS Cache

This one is for the record - an issue I had faced and meant to document, so here goes.

From our app (Java version used is 1.4.2_19), we connect to a GSS load balanced url which operates in a round-robin fashion. For e.g., if the url is http://www.twitter.com, which maps to 128.242.240.148/116/20, the app should get a different IP each time or at least most of the time. But it sticks to the first IP it discovers. The reason is Java's DNS cache.

Java caches DNS settings after the app accesses a url so that any subsequent operation uses the cached setting - in our case, if www.twitter.com resolves to 128.242.240.20 the first time, that value is stored in the DNS cache and everytime twitter is accessed, the IP it resolves to remains the same.

Sounds daft, right. Not quite. This cache is maintained in the InetAddress class - there is one each for successful and unsuccessful host name resolutions. The positive caching is there to guard against DNS spoofing attacks; while the negative caching is used to improve performance. By default, the result of positive host name resolutions are cached forever, because there is no general rule to decide when it is safe to remove cache entries. The result of unsuccessful host name resolution is cached for a very short period of time (10 seconds) to improve performance. Refer the javadoc for more on the above.

In cases like ours, where we don't want DNS caching forever, the solution is to set the TTL value for positive caching to a very low value. This much is well documented. There are 3 ways to set this.
- Set the value from command line using the setting -Dnetworkaddress.cache.ttl=x where x is the number of seconds for which the value is to be cached.
- Use the sun property setting. Works the same as above, only use -Dsun.net.inetaddr.ttl=x in lieu of -Dnetworkaddress.cache.ttl
- Modify the java.security file (Path is $JAVA_HOME/jre/lib/security/java.security) to set the value of networkaddress.cache.ttl to a low value. By default the setting is -1 (cache forever). Note that this setting would affect any application which uses this JDK.

Tried Option 1 - added the -Dnetworkaddress.cache.ttl=60 setting to the weblogic startup file, which should have cleared the cache every 60 seconds. But it did not have the intended effect. There is a related bug 6247501 reported, but that's Windows specific and we were on Unix, so shouldn't have affected us. Looked through the InetAddress caching mechanism, but could not quite figure out why it failed. Gave up after an hour.

Modified the JVM to use the sun setting -Dsun.net.inetaddr.ttl=60 and restarted the domain. Worked like a charm.

Can only conclude that -Dnetworkaddress.cache.ttl does not work from the command line but only from within the java.security file. If you do not wish to change the java.security setting, which would be the case if multiple domains reference the same JDK, a safer option is to use the -Dsun.net.inetaddr.ttl setting from the command line.

Happy caching.

Eclipse Memory Issues

JDK 5 was the default Java version in use, upgraded to JDK1.6.0_21-b06 today and Eclipse started tanking. Starts up fine but after a few minutes would fail with the error 'Internal plug-in action delegate error on creation. PermGen space'. Right, seemed to indicate that I wasn't allocating enough memory for permgen. But the same program works fine with the very same memory settings under JDK 5, so this was definitely an upgrade problem. So much for my support of auto updates :-|

The default setting in eclipse.ini was
--launcher.XXMaxPermSize
256m
-vmargs
-Dosgi.requiredJavaVersion=1.5
-Xms40m
-Xmx256m

And 256m is generally sufficient for permgen, so what gives. Checked the configuration file to confim if the settings had taken effect (Click on Help -> About Eclipse -> Installation Details -> Configuration). Shows up the Xms and Xmx, but no MaxPermGen, hmmm... interesting.
eclipse.vmargs=-Dosgi.requiredJavaVersion=1.5
-Xms40m
-Xmx256m

Googled a bit. The eclipse wiki indicated that it was a bug with Oracle/Sun JDK 1.6.0_21 (had to be my version, duhhh), being tracked as 319514. The bug link is a pretty interesting read.

Apparently, as part of Oracle's rebranding of Sun's products, the Company Name property of the java.exe file, the executable file containing Oracle's JRE for Windows, was updated from "Sun Microsystems" to "Oracle" in Java SE 6u21. Now on Windows, Sun VM is identified using the GetFileVersionInfo API, which reads the company name (present under version details) from jvm.dll or the java executable and compares it against the string "Sun Microsystems". Post update 21, the company name was "Oracle" and the launcher does not recognize this string, hence the -XX:MaxPermSize setting is not honoured.

The workarounds are
- Switch back to '1.6.0_20'
- Change the commandline for launching or add the following line after "-vmargs" to your Eclipse.ini file:-XX:MaxPermSize=256m
- For 32-bit Helios, download the fixed eclipse_1308.dll and place it into (eclipse_home)/plugins/org.eclipse.equinox.launcher.win32.win32.x86_1.1.0.v20100503
- Download and install any of the upgraded versions i.e. version 1.6.0_21-b07 or higher from the java site (alternative link is http://java.sun.com/javase/downloads/index.jsp). Make sure you have b07 or higher by running java -version.

Went with the last option and so far life's good. The MaxPermSize setting shows up under vm options
eclipse.vmargs=-Dosgi.requiredJavaVersion=1.5
-Xms40m
-Xmx256m
-XX:MaxPermSize=256m

But got this bad feeling that when I move to JDK 7, there'll be a newer set of problems cropping up (From the Oracle site: In consideration to Eclipse and other potentially affected users, Oracle has restored the Windows Company Name property value to "Sun Microsystems" in further JDK 6 updates. This value will be changed back to "Oracle" in JDK 7.) Sigh, so much for compatibility.

Monday, June 21, 2010

Google CL rocks

Refer Isaac Truet's post Setup GoogleCL on WinXP if you plan to setup GoogleCL on Windows. Tried it on XP, yet to check on Windows 7.

This post is just to indulge my fantasy of posting a blog from the command line :-)

Command used to post
google blogger post --blog "TechRamblings" --title "Google CL rocks" --tags "python, googlecl, development" googlecl.html

Sunday, June 13, 2010

Eclipse Decompiler Plugins

JADClipse plugin worked fine with Eclipse Galileo (3.5) on my old laptop, but failed on the Dell. The funny thing is JAD works fine from the command line, but cuts no ice with Eclipse. Tried by pointing Eclipse to the JADClipse update url, it downloads the jars yes but decompile it does not. Tried an eclipse -clean, still no luck.

All I can assume is JADClipse does't work on 64 bit machines, weird yes, but true.

An option you can choose to go with is JD-Eclipse plugin. Use Eclipse Update Manager to download JD and change the .class association to the JD Class File Editor.

BTW, if interested, steps to follow for setting up JAD with Eclipse on win32 machines.

SyntaxHighlighter on Blogger

If you intend to post code snippets on your blog, SyntaxHighlighter is the best option available.

Steps to enable it in your blogpost
1. Download the latest version and unzip in a local folder.
2. Upload all the files to either your domain or to sites.google.com.
3. Edit your template (In blogger, click on Design -> Edit HTML) and paste the following snippet under the closing div tag.

4. And finally, at the start of your code snippet, add the <pre class="brush: js"> tag. Close the code snippet with the corresponding </pre> tag.

Saturday, June 12, 2010

Eclipse Maven Integration - Settings file does not exist

If you're trying to integrate Maven 2x with Eclipse and the following error pops up when installing a Maven project, read on. The error is 'cos Eclipse isn't aware of the Maven setup on your machine, so we set M2_HOME and add that under the classpath variables made available to Eclipse. For good measure, we also setup a local repository for Maven (that's unrelated to the fault below, just a good to have)

[ERROR] Error executing Maven.
[ERROR] The specified user settings file does not exist: C:\Users\xxxxxx\.m2\settings.xml

1. Navigate to the directory where Maven is installed - e.g. F:\apache-maven-2.2.1. Copy the absolute path. Create a new environment variable M2_HOME and modify the PATH variable

M2_Home=F:\apache-maven-2.2.1
PATH=%M2_HOME%\bin;...

2. Create a new folder 'repository' under M2_HOME.

3. Edit the file settings.xml present under M2_HOME\conf directory. The localRepository tag element would be commented by default. Uncomment this and modify the value to point to the absolute path of the repository folder u just created ( When a jar file is referenced, Maven first looks for it in the local repository, if not found, it downloads the same from the online repository. )


F:\apache-maven-2.2.1\repository

4. Next Eclipse needs to be told the path to the local maven repository. Select Window -> Preferences. On the LHS pane of the popup, select Java -> Build Path -> Classpath Variables. Check if the value under M2_REPO is the same as the localRepository defined in settings.xml. If not, from the LHS pane, select Maven -> User Settings and change the path to the settings.xml under M2_HOME i.e. F:\apache-maven-2.2.1\conf\settings.xml

That's it.

Saturday, June 5, 2010

Buzz bookmarklet for Chrome

Got tired of copying urls and posting them in Buzz or using Tweetdeck or the Reader as an intermediary, so started checking out bookmarklets. This one was created by taking the delicious bookmarket as base and the testing has been cursory at best. Ensure that the 'Always show Bookmarks bar' is checked, open Bookmark Manager and create a new bookmark, paste the code below. That's it, good 2 go.

javascript:void(window.open('http://www.google.com/buzz/post?url='+encodeURIComponent(window.location.href)+'&title='+ encodeURIComponent(document.title),'buzzwindow','location=yes, links=no,scrollbars=no,toolbar=no,width=750,height=450'));

Friday, June 4, 2010

OpenID and OAuth - Alike yet different

If I hear one more person use the two terms interchangeably, I'll scream, ergo this post.

The similarities first - Both are open standards and co-exist in the security / identity space. Also both involve the consumer and provider sites communicating with each other using standard HTTP protocols.

I'll start off with a brief description of both, before highlighting the differences.

Let's take OpenID first

OpenID is an open decentralized standard that defines a way for web-based applications to authenticate users with a single identity. So a user does not have to maintain multiple username / password combinations, rather you can use an existing account from one of the OpenID providers to sign in to multiple OpenID-enabled websites.

Chances are you already have an OpenID identity, if you have an account with Google or Yahoo, among others. Enable OpenID with them and get the OpenID identifier which comes in the form of a unique URL. Here Google/Yahoo is the OpenID provider - for a list of OpenID providers click here. You, as a user, can choose a certain OpenID provider today and later switch to another, if you so wish - a perfect decentralized setup.

Next come the consumers - that's simple, any application which is OpenID enabled qualifies. For e.g. Google Apps uses OpenID to achieve SSO - JanRain is one of the OpenID solutions available. A list of OpenID supported sites is available here.

A simple example. Consider a scenario where you wish to comment on a blog using your OpenID identity. To enable OpenID commenting on this blog, I need to select 'Registered Users - includes OpenID' against the 'Who can comment' setting.

Next, follow the steps listed below - that in a nutshell is how OpenID works
1. Select your OpenID provider from the drop down menu next to the 'Sign-in using' option.
2. Next, enter the OpenID URI (Note - only the username is requested here).
3. When you click 'Publish your Comment', you will be redirected to your OpenID provider to authenticate your ID. Here you are prompted to enter your password.
4. When you submit the form, the OpenID provider authenticates your credentials, and redirects you back to the comments page and your comment will be automatically posted. Your comment will appear appear with an OpenID icon to the left of the comment.

The risks associated with using OpenID are:
1. Not all sites support OpenID, but its adoption is expected to grow (I've been hearing that since the last couple of years :-))
2. Single point of failure - If your OpenID password is compromised by phishing, you risk compromising your identity + access to consumer sites.

Now for OAuth

OAuth lets you authorize one website (consumer) to access your data from another website (provider). For e.g. take the recent Seesmic integration of Buzz into their desktop and Web apps - that's OAuth behind the scenes. If you want to authorize Seesmic to get access to your Buzz feeds, Seesmic will redirect you to Buzz which will confirm with you before granting access. Note that if you are not logged into Buzz, you will need to, which is fine. Infinitely better than giving a third party app my Buzz credentials. An alternative is to login using OpenID, but won't muddy the waters with that now :-)

Effectively means - OAuth allows
- Consumers to interact with protected data and
- Providers to enable third party apps access to stored data while protecting account credentials.

Detailed explanation here.

Now, as promised, the diffs -
1. OpenID is more of an authentication mechanism, whereas OAuth deals with open authorization. How so, you ask. Well, OpenID is about the provider site authenticating the user for consumer sites. Whereas OAuth is about the provider site authorizing access to its stored data by the consumer sites. Simple see :-)
2. The OpenID provider holds authentication information (read credentials) and a set of general information which it provides to consumer sites, e.g. registration details to prevent you from having to enter your address details every single time. Whereas in OAuth, the stored data held by the provider is shared with consumer sites.
3. Can they co-exist? Hell, yes. But the preferred method of communication right now is OAuth.

Saturday, May 29, 2010

ConcurrentModeFailure on Full GCs - An analysis

Hit a problem on our production env. - summarized below.

We have 1 admin + 3 agent processes (all Java processes) running per box, across 16 boxes. Over the past couple of weeks, it has been observed that on the 6th or 7th day after appln restart, multiple full GCs occur (sometimes pushing 300+ in a day). This is accompanied by decreased throughput and increase in CPU usage. One point worth mentioning is that the framework we use has a heavy multi-threading mechanism.

Specs are
OS: SunOS Rel.5.10
Number of CPUs: 32
System: Sun Sparc
JDK Version: JDK 1.4.2_18

JVM settings in use are

Performance Options

-Xmx1536m -Xms1536m
Min and Max Heap Size - Excludes PermGen space. Setting -Xms and -Xmx to the same value increases predictability by removing the most important sizing decision from the virtual machine.

-XX:NewSize=512m -XX:MaxNewSize=512m
Min and Max Size of New Generation

-XX:PermSize=64m -XX:MaxPermSize=128m
Min and Max Size of Permanent Generation which holds the JVM's class and method objects. It is recommended to set PermSize equal to MaxPermSize to avoid heap resizing when size of PermGen increases. This is one change we plan to do i.e. Set -XX:PermSize to 128m.

-XX:SurvivorRatio=8 
Young generation space is divided into eden and 2 SSs
The size of each is calculated as
Eden = NewSize - (NewSize/(SurvivorRatio + 2)) * 2) = 409.6MB
From space = (NewSize - Eden) / 2 = 51.2MB
To space = (NewSize - Eden) / 2 = 51.2MB
Old gen space = 1536 - 512 = 1014MB

-XX:TargetSurvivorRatio=90
Tenuring threshold that is used to copy the tenured objects in the young generation. In this case, allows 90% of the survivor spaces to be occupied instead of the default 50%

-XX:ParallelGCThreads=4
Default value is equal to number of CPUs. Optimised in our case to 4

-XX:MaxTenuringThreshold=10 
The number of times an oject is aged in the young generation

Debugging Options

-XX:+PrintGCDetails    
-XX:+PrintGCTimeStamps 
-XX:+PrintTenuringDistribution 
-XX:+PrintGCApplicationConcurrentTime 
-XX:+PrintGCApplicationStoppedTime 

Behavioral Options

-XX:+UseConcMarkSweepGC
Use concurrent mark-sweep collection for the old generation

-XX:+DisableExplicitGC    
Disable calls to System.gc()

-XX:+UseParNewGC          
Use multiple threads for parallel young generation collection

-XX:-UseAdaptiveSizePolicy 
Specifically disable UseAdaptiveSizePolicy since it should be used with UseParallelGC, not UseParNewGC

-Xloggc:/log/file1.gc
Log output file name

A word about how the generations are structured before we progress.

The young generation is eden + 2 survivor spaces. An object when created is allocated in eden. At any given point of time, one SS is empty and serves as a destination of the next. During a minor collection, the living objects in eden are copied to the first SS. If all the living objects in eden do not fit in one SS, the remaining living objecs are promoted into the old generation. During the next minor collection, the living objects from eden and the first SS are copied to the second SS.

As more objects become tenured, the old object space begins to reach maximum occupancy. The GC algo in use is mark-compact collection. It scans all objects, marking all reachable objects, then compacts all remaining gaps of dead objects. The advantage this algo has over copy collect is that it requires less memory and eliminates memory fragmentation.

Also, the concurrent low pause collector starts a collection when the occupancy of the tenured generation reaches a specified value (by default 68%). It attempts to minimize the pauses due to garbage collection by doing most of the garbage collection work concurrently with the application threads.

The concurrent collector uses a background thread that runs concurrently with the application threads to enable both GC and object handling at the same time. The collector collects garbage in phases, two are stop-the-world phases, and four are concurrent i.e. they run with app threads. The phases are:

1. Initial-Mark-Phase (Stop-the-world) - Stop all Java threads, marking all the objects directly reachable from the roots, and restarting the Java threads.
2. Mark-Phase (Concurrent) - Start scanning from marked objects and transitively mark all objects reachable from the roots. The mutators are executing during the concurrent phases 2,3 and 5 below and any objects allocated in the CMS generation during these phases (including promoted objects) are immediately marked as live.
3. Pre-cleaning Phase (Concurrent) - During Phase 2, mutators may be modifying objects. Any object that has been modified since the start of the concurrent marking phase (and which was not subsequently scanned during that phase) must be rescanned. This phase scans objects that have been modified concurrently. Due to continuing mutator activity, the scanning for modified cards may be done multiple times.
4. Final Checkpoint or Remark-Phase (Stop-the-world) - Also called Stop-the-world phase. With mutators stopped the final marking is done by scanning objects reachable from the roots and by scanning any modified objects. Note that after this phase there may be objects that have been marked but are no longer live. Such objects will survive the current collection but will be collected on the next collection.
5. Sweep-Phase (Concurrent) - Collects dead objects. The collection of ad dead object adds the space for the object to a free list for later allocation. Coalescing of dead objects may occur at this point. Note that live objects are not moved.
6. Reset-Phase (Concurrent) - Clears data structures in preparation for the next collection.

Normally the concurrent low pause collector does not copy or compact the live objects. In 1.4.2, if fragmentation in the tenured generation becomes a problem, a compaction of the tenured generation will be done although not concurrently.

Observations

1. Continuous CMS runs
The CMS logs indicated that CMS-initial-mark i.e. beginning of tenured generation collection triggers when the tenured generation space reaches 54% of capacity ( could this be triggered when PermGen space is full? – Update: Understanding is that when using CMS, perm gen collection is turned off by default and a serial Full FC collects the perm gen). The sweep is unable to clear tenured space, hence this cycle is repeated number of times.

2. Concurrent mode failure
Full GCs have resulted in concurrent mode failure (snippet from the log below). Here PermGen had touched max but during full GC, was cleared [CMS Perm : 131009K->27464K(131072K)]. However, the tenured space which was at 470672 has only fallen to 409399 : [CMS (concurrent mode failure): 470672K->409399K(1048576K), 26.7835924 secs]

Snippet from the log indicating 'Concurrent mode failure'

Application time: 21.5362235 seconds
198867.489: [Full GC 198867.489: [ParNew
Desired survivor size 48306584 bytes, new threshold 10 (max 10)
- age   1:    2517240 bytes,    2517240 total
- age   2:    2457600 bytes,    4974840 total
- age   3:     336264 bytes,    5311104 total
- age   4:     418808 bytes,    5729912 total
- age   5:    3270656 bytes,    9000568 total
- age   6:    2665920 bytes,   11666488 total
- age   7:    2297312 bytes,   13963800 total
- age   8:     951136 bytes,   14914936 total
- age   9:    1267616 bytes,   16182552 total
- age  10:     469848 bytes,   16652400 total
: 305816K->16323K(471872K), 0.1235504 secs]198867.613: [CMS (concurrent mode failure): 470672K->409399K(1048576K), 26.7835924 secs] 775463K->409399K(1520448K), [CMS Perm : 131009K->27464K(131072K)], 26.9079389 secs]
Total time for which application threads were stopped: 26.9141975 seconds

3. Promotion Failed
This shows that a ParNew collection was requested, but was not attempted. The reason is that it was estimated that there was not enough space in the CMS generation to promote the worst-case surviving young generation objects. This failure is termed a "full promotion guarantee failure".
As a result, the concurrent mode of CMS is interrupted and a full GC invoked.
(Source: http://www.sun.com/bigadmin/content/submitted/cms_gc_logs.jsp)

Snippet from the log indicating 'Promotion Failed'

803895.914: [GC 803895.914: [ParNew (promotion failed)
Desired survivor size 48306584 bytes, new threshold 7 (max 10)
- age   1:   18086800 bytes,   18086800 total
- age   2:    1270960 bytes,   19357760 total
- age   3:    4698768 bytes,   24056528 total
- age   4:   19146288 bytes,   43202816 total
- age   5:    2225224 bytes,   45428040 total
- age   6:    2812752 bytes,   48240792 total
- age   7:    1029632 bytes,   49270424 total
- age   8:     805968 bytes,   50076392 total
- age   9:     930080 bytes,   51006472 total
- age  10:    1159576 bytes,   52166048 total
: 462419K ->462419K(471872K), 0.5391725 secs]803896.453: [CMS803899.534: [CMS-concurrent-mark: 13.208/14.159 secs]
 (concurrent mode failure): 1030597K->1048536K(1048576K), 35.8992699 secs] 1491791K->1055004K(1520448K), 36.4391620 secs]
Total time for which application threads were stopped: 36.4773697 seconds

803936.523: [Full GC 803936.523: [ParNew: 419456K->419456K(471872K), 0.0000468 secs]803936.523: [CMS803946.382: [CMS-concurrent-mark: 10.841/10.968 secs]
 (concurrent mode failure): 1048536K->1044407K(1048576K), 35.0106907 secs] 1467992K->1044407K(1520448K), [CMS Perm : 28343K->28337K(131072K)], 35.0114078 secs]
Total time for which application threads were stopped: 35.1389167 seconds

Recommendations

1. As per the java sun forums, the concurrent mode failure could occur for 2 reasons (listed below).

• If the collector is unable to finish reclaiming the unreachable objects before the tenured generation fills up – In our case, tenured generation available is 1048576 and it is less than half full

• If allocation cannot be satisfied with the available free space blocks in the tenured generation.
(Source: http://www.sun.com/bigadmin/content/submitted/cms_gc_logs.jsp )

If the concurrent collector is unable to finish reclaiming the unreachable objects before the tenured generation fills up, or if an allocation cannot be satisfied with the available free space blocks in the tenured generation, then the application is paused and the collection is completed with all the application threads stopped. This suggests that even if the logs indicate concurrent mode failure, the collection should have completed. But in this case, the tenured space could not be cleared – indicates that all the objects in tenured space are live objects.

2. Try each of the options listed below in turn.

• Try a larger total heap and/or smaller young generation. But often it just delays the problem.

• Make the application do a full, compacting collection at a time which will not disturb users. If the application can go for a day without hitting a fragmentation problem, try a System.gc() in the middle of the night. That will compact the heap and we can hopefully go another day without hitting the fragmentation problem.

• If most of the data in the tenured generation is read in when the application first starts up and a System.gc() can be done after complete initialization, that might help by compacting all data into a single chunk leaving the rest of the tenured generation available for promotions. Not true in our case.

• Start the concurrent collections earlier. The low pause collector tries to start a concurrent collection just in time (with some safety factor) to collect the tenured generation before it is full. Try starting a concurrent collection sooner so that it finishes before the fragmentation becomes a problem. The concurrent collections don't do a compaction, but they do coalese adjacent free blocks so larger chunks of free space can result from a concurrent collection. One of the triggers for starting a concurrent collection is the amount of free space in the tenured generation. You can cause a concurrent collection to occur early by setting the option -XX:CMSInitiatingOccupancyFraction= where NNN is the percentage of the tenured generation that is in use above which a concurrent collection is started. This will increase the overall time spent doing GC but may avoid the fragmentation problem. But his will be more effective with 5.0 because a single contiguous chunk of space is not required for promotions.
(Source: http://blogs.sun.com/jonthecollector/entry/when_the_sum_of_the)

This post will be updated based on results from further tests shortly.

Analyze heap dumps

The first step towards analyzing a memory leak in Java is to pull out the heap dump when full GC is in progress and then run the heap dump past any memory analyzer to uncover the memory leaks. Will be covering reasons for continuous full GCs in the next post.

To Generate Heap Dump

1. The first and preferred option is to modify the JVM settings to include the following 2 options. Ensure that your production startup scripts have these options in place.

-XX:+HeapDumpOnOutOfMemoryError
-XX:+HeapDumpOnCtrlBreak

Type kill -3 <pid> to generate the hprof file in the bin folder. The 2 commands listed above are non-intrusive and a heap dump is only generated either during an OOME or on demand. It is also considerably lighter on resources as compared to JMAP.

2. Use JMAP - Java 5 (and later versions of Jdk 1.4 starting Jdk 1.4.2_18) ships with a tool called JMAP. It attaches itself to the JVM and obtains heap layout information, class histograms and complete heap snapshots. Heap layout information is instantly retrieved, with no impact on the running application. However, taking histograms and heap snapshots take longer, and also affect memory / CPU of the application, resulting in either slow response times or complete stalling of the application. So, schedule this activity when the system load is low.

Commands to be used
ps -ef | grep <username> to get the list of PIDs for all the processes running on a system
jmap <pid> > out.txt - This command will print out the same information as pmap
jmap –heap <pid> >> out.txt - Prints out java heap summary
jmap –heap:format=b <pid> >> out.txt - Prints out java heap in hprof binary format (generally called heap.bin) in the current directory
jmap –histo <pid> >> out.txt - Prints out histogram of java object heap

Next run the generated heap dump past one of the following tools

1. Eclipse Memory Analyzer
Download the MAT from http://www.eclipse.org/mat/downloads.php. Open the hprof dump generated using jmap or HeapDumpOnCtrlBreak. MAT parses the dump and generates a visual rep which gives a break down in terms of leak suspects, components and consumers.

On opening the heap dump, you see an info page with a chart of the biggest objects, and in many cases you will notice a single huge object already here. Click on the "Leak Suspects" link of the Overview and drilldown through the HTML report produced. Further reading

2. JHat which ships with Java 6 (Mustang) - To run the tool, use the command

jhat -J-mx512m -stack false heap.bin

Axis 2 and Transfer-Encoding Chunked

We upgraded one of our clients from Axis 1.2 to Axis 2 and it was observed that calls to the downstream system started failing. Checked with the backend guys and they confirmed that our IP address showed up in their access logs, so the handshake had succeeded. But the request xml was not received at their end.

The first step was to plug in tcpmon and trace the data sent over the wire. (TCPMon is a utility that allows the user to monitor the messages passed along in TCP based conversation). If Eclipse is your preferred IDE, download up the TCPMon Eclipse plugin, copy the same to the ECLIPSE_HOME/plugins directory, go to Window -> Show View -> Other -> TCP Monitor -> TCP Monitor and it will open the TCPMon View. Change the endpoint to the TCP Listener endpoint and configure the actual downstream endpoint in TCPMon.

Logs indicated that Transfer-Encoding was being sent as "chunked". Now this was different from the Axis 1.2 logs which had Content-Length set in the HTTP header. Reading up the Axis2 documentation threw some more light on the topic.

By default Apache Axis 1.2 uses 'org.apache.axis.transport.http.HTTPSender' and this in turn uses HTTP/1.0, hence the default behaviour is to set Content-Length. But with Axis 2 which uses HTTP/1.1, the default is Transfer-Encoding set to "chunked". When the data is sent in chunked format, there are some special characters introduced in the payload (refer the snippet below). If the downstream system is a legacy system which does not support HTTP 1.1 (as is ours) and is unable to parse the message due to these additional characters, it returns a Soap Fault Exception.

The solution to this is to disable HTTP chunking as follows.
options.setProperty(org.apache.axis2.transport.http.HTTPConstants.CHUNKED, Boolean.FALSE);

That said, if this issue is caught in the development stage, check if the team hosting the web service can support HTTP Chunking, since it is the preferred approach, especially when dealing with large messages like messages with attachments (MTOM).

HTTP Header using Axis 2

POST /test HTTP/1.1
Content-Type: text/xml; charset=UTF-8
SOAPAction: "http://test.com/wsdl/07/08/10#testDiagnosticRequest"
User-Agent: Axis2
Host: 127.0.0.1:8086
Transfer-Encoding: chunked
1aa









"
0

HTTP Header using Axis 1.2

Content-Type: text/xml; charset=utf-8
Accept: application/soap+xml, application/dime, multipart/related, text/*
User-Agent: Axis/1.2.1
Host: 127.0.0.1:8086
Cache-Control: no-cache
Pragma: no-cache
SOAPAction: "urn:GAT__Inbox/OpCreate"
Content-Length: 6698


...

Saturday, February 6, 2010

Hibernate, Tomcat 6.0 and mysql

My normal configuration stack is Spring, Hibernate and Oracle on Weblogic. Switched today to the lightweight mysql and Tomcat - blame it on limited RAM.

Issue: The error thrown up in the logs is
java.lang.UnsupportedOperationException: Not supported by BasicDataSource
at org.apache.tomcat.dbcp.dbcp.BasicDataSource.getConnection(BasicDataSource.java:899)
...

The configuration files are listed below

server.xml located at %CATALINA_HOME%/conf

context.xml located at %CATALINA_HOME%/conf

hibernate.cfg.xml





com.mysql.jdbc.Driver
java:comp/env/jdbc/products
true
org.hibernate.dialect.MySQLDialect
root
xxxxxx

Reason: Hibernate.jar/DatasourceConnectionProvider/getConnection method invokes Tomcat 6/BasicDataSource passing the username and password.
But the BasicDataSource getConnection(String username, String password) method throws UnsupportedOperationException.

Solution: Remove the credentials from the hibernate.cfg.xml file, retain 'em only in the datasource setting. This will result in the no argument getConnection method being invoked which is implemented in BasicDataSource