Issue Details (XML | Word | Printable)

Key: GLASSFISH-17504
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Critical Critical
Assignee: Mahesh Kannan
Reporter: lprimak
Votes: 2
Watchers: 8
Operations

If you were logged in you would be able to see more operations.
glassfish

High Availability (HA) webapps slow, corrupted sessions, and java.util.concurrent.TimeoutException

Created: 27/Oct/11 07:32 PM   Updated: 05/Feb/12 06:12 AM   Resolved: 06/Jan/12 04:53 PM
Component/s: web_container
Affects Version/s: 3.1.1
Fix Version/s: 3.1.2_b17

Time Tracking:
Not Specified

File Attachments: 1. Zip Archive server_logs.zip (10 kB) 27/Oct/11 07:32 PM - lprimak

Environment:
  • Linux, 8GB RAM, 8-core Intel CPU, Cluster of 2 machines.
  • Both session loss and slowness coincide directly with the TimeoutException
  • The app used is our internal app, we are having trouble to reproducing this wit cluster.jsp directly
  • aside from the --distributable-- directive, there is no tuning in web.xml, there is no glassfish-web.xml at all
  • application is deployed from the Admin GUI, with no changes in any of the checkboxes, aside from the 'availability
  • availability is set at deployment time, not after
  • no relaxVersionSemantics property
  • session loss occurs frequently but not always, but always there is shoal TimeoutException in the logs that corresponds to session loss
  • session size is around 50k
  • cluster has 2 nodes, both are full (not virtual) machines
  • There is no traffic (test server) just sitting trying to use the app with one browser
  • The issue happens whether you use a load balancer or not, even when hitting the server directly,

although it's much easier to reproduce with a sticky-session load balancer (apache/mod-proxy-ajp)


Tags: shoal availability replication sessions 3_1_2-approved
Participants: jjackb, Joe Di Pol, Joe Fialli, lprimak, Mahesh Kannan, michalkurtak, shreedhar_ganapathy and Tushar Patidar


 Description  « Hide

I set up a cluster, and deployed my JSP application onto it.
It works great until I turn on high-availability for this application via the Admin console.
Once I do that, it becomes very slow, and session state gets lost every 2 requests or so.
Disabling high-availability cures the problem.

I did run verity_multicast, GMS is running, cluster health is good, followed the documentation,
and didn't do anything 'weird' or customized'.
I also have in my web application.

There are no errors in the log files. When I turn on high-availability, I do get this error very frequently:
[#|2011-03-06T02:13:00.297-0500|WARNING|glassfish3.1|org.shoal.ha.cache.command.load_request|_ThreadID=27;_ThreadName=Thread-1;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#]



shreedhar_ganapathy added a comment - 27/Oct/11 07:50 PM

Couple of questions :
1. Have you tied using any of the supported LBs with sticky sessions enabled i.e. Apache with mod_jk (we dont support mod_proxy ajp yet although I suspect this may not be contributing to the issue)
or try Oracle Iplanet Web Server/Apache/IIS with GlassFish LB Plugin

2. Does your app employ Ajax calls? Ajax based request responses may result in request version numbers within sessions to be incremented incrementally before a given request has completed replication and returned - as a result, this may result in sessions not be found for that incremented request version number causing a new session to be created.
In order to work around this, you have to place the relaxCacheVersionSemantics property in the glassfish-web.xml descriptor.

Here's a snippet

<session-config>
<session-manager persistence-type="replicated">
<manager-properties>
<property name="relaxCacheVersionSemantics" value="true"/>
</manager-properties>
</session-manager>
</session-config>

Let us know if any of the above resolves/reproduces the issue with more information. At our end we are trying to reproduce with our apps but cannot reproduce.

If its possible to share your app, that would also help.


lprimak added a comment - 28/Oct/11 05:38 PM

I tried relaxCacheVersionSemantics=true code, but it did not work.

I did not try the load balancer, but that isn't even involved. When I hit Glassfish server directly
without any load balancer, the issue still exists.
I cannot share the application, because it is very database-driven and I can't give access to that for obvious reasons.


Mahesh Kannan added a comment - 01/Nov/11 04:39 AM

You said that there are two machines involved. When you use a browser, typically, the cookies (for example JSESSIONID) will NOT be sent to the second machine. This could be the reason why session failover may not be working for you. (However, this wont be an issue if you use an LB)

I suggest you setup an LB and try your app. OR you can create a cluster of two instances that run on the same machine. If you have the instances running on the same server instance (same machine), you can use a browser to jump from one instance to another.

hope this helps.


lprimak added a comment - 01/Nov/11 06:31 AM

Believe me, the load balancer is not the issue. This is in no way related to the load balancer. I tried all kinds of setup with no results. with load balancer, without, trying to isolate the problem is how I created non load balance test.

In my non load balancer Tests I just hit only one instance so replication wasn't even used, but the session loss was still there and the timeout exception too. The timeout exception is the key he and needs to be found and fixed.

This issue is going on in multiple environments and reported by multiple users. This is not an invalid issue. This is not an environmental issue. This is a bug in glassfish and shoal in particular. This is not an operator error. This has been going for more than a year with lots of people trying lots of things to fix it with no results.


Tushar Patidar added a comment - 15/Nov/11 10:53 AM

I observed the same behavior with logs indicating LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException. I have deployed a 2 instance cluster spanning two separate physical hosts. Instances keep on logging such WARNING messages. I have fronted the cluster with Apache mod_jk LB.


michalkurtak added a comment - 16/Nov/11 07:49 AM

Hello.
We are observing same problem. We have 2 node cluster with 4 instances (2 instances on one node). Cluster is very very slow. It is obvious when static content (e.g. images) is served in parallel from glassfish servers. 120B images are served in 3-4 seconds. We have haproxy with sticky session loadbalancer in front of cluster. So requests arrive on same instance and session is lost.

We have this message in logs:
[#|2011-11-15T16:12:37.526+0100|WARNING|glassfish3.1|org.shoal.ha.cache.command.load_request|_ThreadID=47;_ThreadName=Thread-2;|LoadRequestCommand timed out while waiting for result java.util.concurrent.TimeoutException|#]


Mahesh Kannan added a comment - 14/Dec/11 06:24 PM

shoal replication module currently, doesn't handle the case when there are no replication partners. It just attempts to replicate data even if there are no replica instances running. I have a patch that fixes the issue. Will be available in the next shoal promotion.

Regarding TimeoutException, it (TimeoutException) is thrown only when load requests (to load a session from replica) fails to load within a reasonable time. This is not the root cause itself. The root cause is to identify why the sessions are not found.

Had a discussion with the web container team and it looks like there is a race condition when AJax calls are involved. I am working on the fix for this as well.


lprimak added a comment - 14/Dec/11 06:41 PM

Can you elaborate a bit further on this?
I have two instances in the cluster, shouldn't they be the required replication partners?
Also, I have no AJAX calls, but this the web site can trigger a race condition by just retrieving many URLs in parallel with the same session ID cookie.


Mahesh Kannan added a comment - 21/Dec/11 06:16 PM

<comment>

Can you elaborate a bit further on this?
I have two instances in the cluster, shouldn't they be the required replication partners?
Also, I have no AJAX calls, but this the web site can trigger a race condition by just retrieving many URLs in parallel with the same session ID cookie.

<comment>

If you have two instances then they are discovered and one will act as a replica for the other.

You mentioned that there could be parallel threads accessing the same session. This is what exactly AJAX type applications do. In this case, the web container will issue a bunch of save (or updateTimeStamp) calls to the replication module in parallel for the same session ID. Either the web container and / or the replication module need to handle concurrent saving of same sessions properly.

This issue is exactly same as 17344


lprimak added a comment - 22/Dec/11 04:36 PM

Is there a Glassfish plugin or a version that I can test with now to see if this is really true? I remember that the Shoal team cannot reproduce the problem right now, and I would love to confirm that this indeed is the cause of the problem.
Thanks!


Mahesh Kannan added a comment - 27/Dec/11 09:02 PM

There is no plugin to test this. We are trying to provide a fix for this issue in 3.1.2. Can you please reproduce the issue using your app on 3.1.2 (using the latest nightly build).

I am currently testing a patch that fixes this issue. Once the patch is ready and integrated, you pick up the next available promoted build of 3.1.2 to test it.


Mahesh Kannan added a comment - 03/Jan/12 09:27 PM

I have the patch ready (using GlassFish 3.1.2 trunk).

If you have your tests setup on 3.1.2, I can post the patch.

Will checkin after code review


lprimak added a comment - 04/Jan/12 08:43 PM

Can you post a binary release somewhere so I don't have to compile, apply patch and download? I never built from source and would like to avoid it if possible.
Thanks


jjackb added a comment - 05/Jan/12 04:00 PM

I am also interested in a binary release to test the patch because this might be the solution to this bug:
http://java.net/jira/browse/GLASSFISH-15575
-> reported in early 2011 during gf 3.1 beta-testing and describes the same problem and log entries
-> problem still exists with gf 3.1.1 in production environment


Mahesh Kannan added a comment - 06/Jan/12 01:46 AM

The promoted builds are available at:

http://dlc.sun.com.edgesuite.net/glassfish/3.1.2/promoted/

Please wait for the next promotion


Mahesh Kannan added a comment - 06/Jan/12 04:45 AM

You also need to update your sun-web.xml with the following content:

Note that the manager-properties contains: relaxCacheVersionSemantics

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE sun-web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Application Server 9.0 Servlet 2.5//EN" "http://www.sun.com/software/appserver/dtds/sun-web-app_2_5-0.dtd">
<sun-web-app error-url="">
<context-root>/ctestservlet</context-root>
<class-loader delegate="true"/>
<session-config>
<session-manager persistence-type="replicated">
<manager-properties>
<property name="persistenceFrequency" value="web-method"/>
<property name="relaxCacheVersionSemantics" value="true"/>
</manager-properties>
<store-properties>
<property name="persistenceScope" value="session"/>
</store-properties>
</session-manager>
<session-properties/>
<cookie-properties/>
</session-config>
<jsp-config>
<property name="keepgenerated" value="true">
<description>Keep a copy of the generated servlet class java code.</description>
</property>
</jsp-config>
</sun-web-app>


Mahesh Kannan added a comment - 06/Jan/12 05:01 AM

I am going to close this issue. If you update the sun-web.xml with the relaxCacheVersionSemantics, you should see a considerable increase in performance.

Without the relaxCacheVersionSemantics, there were too many load requests to load the session from replica instance causing a considerable delay in loading the page.

I will keep issue number 17344 open though (http://java.net/jira/browse/GLASSFISH-17344)


Joe Di Pol added a comment - 06/Jan/12 03:47 PM

From Mahesh:

  • What is the impact on the customer of the bug?
    Moderate.
  • How likely is it that a customer will see the bug and how serious is the bug?
    Customers who are using AJAX will face this issue.
  • Is it a regression? Does it meet other bug fix criteria (security, performance, etc.)?
    Yes. 2.x handled AJAX related calls well
  • What is the cost/risk of fixing the bug?
    Moderate

How risky is the fix? How much work is the fix? Is the fix complicated?
Moderate. I had to touch 9 files (all in) failover / replication module. The fix is straightforward but had to touch 9 files

  • Is there an impact on documentation or message strings?
    No changes to docs required. Since 2.x documentation already talks about AJAX related settings that must be specified in glass fish-web.xml
  • Which tests should QA (re)run to verify the fix did not destabilize GlassFish?
    SQE HA tests. These tests have been run with the patch and all passed.
  • Which is the targeted build of 3.1.2 for this fix?
    Next build

Mahesh Kannan added a comment - 06/Jan/12 04:53 PM

Tested with the ctestservlet mentioned in 15575. The real issue here is that the app uses multiple gifs/jpegs that causes a browser to make concurrent requests to the server. Due to the absence of relaxVersionSemantics in sun-web.xml, the web container makes approximately 7 load_requests to the replication layer for every page access!

Some of the load_requests were lost because we do batching (using a map) based on sessionid.

I have fixed the loss of load_requests with fix to shoal (commit version 1732).
After adding the relaxVersionSemantics to the app, there were no session loss.

<comment from submitter>
cluster has 2 nodes, both are full (not virtual) machines. There is no traffic (test server) just sitting trying to use the app with one browser
</comment from submitter>

I would like to add that if there are multiple physical machines, you have to use a load balancer otherwise jsessionid cookie will not be automatically sent by the browser. This has nothing to with replication or web container. This is how browsers work.


Joe Fialli added a comment - 10/Jan/12 07:15 PM

Shoal 1.6.17 integrated into bg trunk as part of svn version 52009 on January 10, 2012.
Fix should be in next promoted build which is 4.0 b19.


Joe Fialli added a comment - 10/Jan/12 07:15 PM

Shoal 1.6.17 integrated into bg trunk as part of svn version 52009 on January 10, 2012.
Fix should be in next promoted build which is 4.0 b19.


lprimak added a comment - 04/Feb/12 05:26 AM

Looks like this is confirmed fixed now. Thanks a lot for your efforts.
I didn't even need to do this: <property name="relaxCacheVersionSemantics" value="true"/>
and it still works great!


lprimak added a comment - 05/Feb/12 06:12 AM

Looks the replication problems are not fixed in 3.1.2b20,
Some session attributes are getting lost, seemingly being overwritten
by another node in the cluster with older data.
The TimeoutExceptinos and slow performance are fixed though.

I opened another issue regarding this:
http://java.net/jira/browse/GLASSFISH-18322