Issue Details (XML | Word | Printable)

Key: GLASSFISH-15428
Type: Bug Bug
Status: Closed Closed
Resolution: Incomplete
Priority: Major Major
Assignee: Joe Fialli
Reporter: varunrupela
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
glassfish

[STRESS] NoClassDefFoundError from Shoal observed on an instance's log when another instance goes down

Created: 04/Jan/11 02:56 AM   Updated: 05/Jan/11 10:39 AM   Resolved: 05/Jan/11 10:39 AM
Component/s: group_management_service
Affects Version/s: 3.1_b35
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments: 1. Zip Archive jrockit_windows_gmssuspected_failure.zip (12 kB) 05/Jan/11 07:22 AM - Joe Fialli
2. Text File server.log (197 kB) 04/Jan/11 02:56 AM - varunrupela

Issue Links:
Dependency
 

Tags:
Participants: Joe Fialli and varunrupela


 Description  « Hide

Please see parent issue http://java.net/jira/browse/GLASSFISH-15425 for details of the scenario that shows this bug.

The following log messages were observed in instance 102 when instance103 went down:

**********
[#|2011-01-02T22:23:05.584+0530|INFO|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1092: GMS View Change Received for group: st-cluster : Members in view for IN_DOUBT_EVENT(before change analysis) are :
1: MemberId: instance101, MemberType: CORE, Address: 10.12.153.54:9195:228.9.37.37:31818:st-cluster:instance101
2: MemberId: instance102, MemberType: CORE, Address: 10.12.153.52:9147:228.9.37.37:31818:st-cluster:instance102
3: MemberId: instance103, MemberType: CORE, Address: 10.12.153.53:9110:228.9.37.37:31818:st-cluster:instance103
4: MemberId: server, MemberType: SPECTATOR, Address: 10.12.153.54:9103:228.9.37.37:31818:st-cluster:server

#]

[#|2011-01-02T22:23:05.584+0530|INFO|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1016: Analyzing new membership snapshot received as part of event: IN_DOUBT_EVENT for member: instance103 of group: st-cluster|#]

[#|2011-01-02T22:23:05.585+0530|INFO|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1007: Received FailureSuspectedEvent for member: instance103 of group: st-cluster|#]

[#|2011-01-02T22:23:05.586+0530|WARNING|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1090: handled exception processing event packet IN_DOUBT_EVENT from instance103|#]

[#|2011-01-02T22:23:05.586+0530|WARNING|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|stack trace
java.lang.NoClassDefFoundError:
at com.sun.enterprise.ee.cms.impl.base.ViewWindowImpl.addInDoubtMemberSignals(ViewWindowImpl.java:399)
at com.sun.enterprise.ee.cms.impl.base.ViewWindowImpl.analyzeViewChange(ViewWindowImpl.java:287)
at com.sun.enterprise.ee.cms.impl.base.ViewWindowImpl.newViewObserved(ViewWindowImpl.java:222)
at com.sun.enterprise.ee.cms.impl.base.ViewWindowImpl.run(ViewWindowImpl.java:193)
at java.lang.Thread.run(Thread.java:662)

#]

***********

The server log showing this message is attached



varunrupela made changes - 04/Jan/11 02:57 AM
Field Original Value New Value
Link This issue depends on GLASSFISH-15425 [ GLASSFISH-15425 ]
varunrupela made changes - 04/Jan/11 02:58 AM
Link This issue depends on GLASSFISH-15425 [ GLASSFISH-15425 ]
varunrupela made changes - 04/Jan/11 02:58 AM
Link This issue blocks GLASSFISH-15425 [ GLASSFISH-15425 ]
Joe Fialli made changes - 04/Jan/11 07:44 AM
Status Open [ 1 ] In Progress [ 3 ]
Joe Fialli made changes - 04/Jan/11 07:44 AM
Status In Progress [ 3 ] Open [ 1 ]
Joe Fialli added a comment - 04/Jan/11 08:03 AM

Here is the source code line where the NoClassDefFoundError occurred.

signals.add(new FailureSuspectedSignalImpl(token, member.getGroupName(), member.getStartTime()));

So the class that was not found was FailureSuspectedSignalImpl which is defined in shoal-gms-impl.jar.
2975 Fri Dec 03 09:11:38 EST 2010 com/sun/enterprise/ee/cms/impl/common/FailureSuspectedSignalImpl.class

There was no class not found error when reporting Failure 2 seconds later based on following log event from server.log.
[#|2011-01-02T22:23:07.601+0530|INFO|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1016: Analyzing new membership snapshot received as part of event: FAILURE_EVENT for member: instance103 of group: st-cluster|#]

[#|2011-01-02T22:23:07.603+0530|INFO|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1019: member: instance103 of group: st-cluster has failed|#]

So we can assume that FailureSignalImpl was found in the same shoal-gms-impl.jar. (unless the ClassNotFoundError is in next server.log file).

dhcp-burlington9-3rd-a-east-10-152-23-224:impl jf39279$ jar tvf target/shoal-gms-impl.jar | grep FailureNotificationSignalImpl
3798 Fri Dec 03 09:11:38 EST 2010 com/sun/enterprise/ee/cms/impl/common/FailureNotificationSignalImpl.class

This does not look like a bug in Shoal GMS but rather a higher level class loading issue.
Potentially, one could see this type of behavior when there was a previous OutOfMemoryException
thrown. Looking through the attached server.log, there was no previous history of OutOfMemory,
but there was evidence of other class loading issues. Specifically, the following log event occurred
5 times before the reported NoClassDefFoundError.

[#|2011-01-02T20:28:35.674+0530|SEVERE|glassfish3.1|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=22;_ThreadName=Thread-3;|Exception in thread "RMI RenewClean-[10.12.153.52:27688]" |#]

[#|2011-01-02T20:28:35.675+0530|SEVERE|glassfish3.1|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=22;_ThreadName=Thread-3;|java.lang.ClassFormatError: sun/reflect/GeneratedSerializationConstructorAccessor9015 : illegal JVM_CONSTANT_Methodref name: <init>
at sun.misc.Unsafe.defineClass(Native Method)
at sun.reflect.ClassDefiner.defineClass(ClassDefiner.java:45)
at sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:381)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:377)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:95)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:313)
at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1327)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:437)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:413)
at java.io.ObjectStreamClass.lookup0(ObjectStreamClass.java:310)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java)

At this point, there is not enough info in this report to investigate this further.
It would be helpful to note if there were any previous out of memory exceptions occuring in this instance's server.logs before this occurred. (The other instance is failing presumably due to out of memory, but it is only this instance that is of interest)
Once there is an out of memory in server.log, all future failures are suspect and may be result of system dealing with out of memory issues.


Nazrul made changes - 04/Jan/11 01:25 PM
Summary NoClassDefFoundError from Shoal observed on an instance's log when another instance goes down [STRESS] NoClassDefFoundError from Shoal observed on an instance's log when another instance goes down
Joe Fialli added a comment - 04/Jan/11 01:37 PM

I researched if this could possibly be a missing osgi export.

Export-Package below is from shoal-gms-impl.jar manifest. Both referencing class ViewWindowImpl and
class not found FailureSuspectedSignalImp are in same jar: shoal-gms-impl.jar. The [1] rule below
represents the dependency of com.sun.enterprise.ee.cms.impl.base.ViewWindowImpl
on com.sun.enterprise.ee.cms.impl.common.FailureSuspectedSignalImpl.

From shoal-gms-impl.jar/META-INF/MANIFEST.MF
(added carriage returns below so each rule could be inspected.)

Export-Package: com.sun.enterprise.ee.cms.impl.common;uses:="com.sun.e
nterprise.ee.cms.core,com.sun.enterprise.ee.cms.spi,com.sun.enterpris
e.ee.cms.impl.client,com.sun.enterprise.ee.cms.logging,com.sun.enterp
rise.ee.cms.impl.base",

com.sun.enterprise.ee.cms.impl.client;uses:="c
om.sun.enterprise.ee.cms.core,com.sun.enterprise.ee.cms.logging",

<!-- begin [1] -->
com.sun.enterprise.ee.cms.impl.base;uses:="com.sun.enterprise.ee.cms.logg
ing,com.sun.enterprise.ee.cms.core,com.sun.enterprise.ee.cms.impl.com
mon,com.sun.enterprise.ee.cms.spi,com.sun.enterprise.mgmt,com.sun.ent
erprise.mgmt.transport,com.sun.enterprise.mgmt.transport.grizzly",
<!-- end [1] -->

com .sun.enterprise.gms.tools,com.sun.enterprise.ee.cms.logging;uses:="su
n.security.action",
com.sun.enterprise.mgmt;uses:="com.sun.enterprise.
ee.cms.impl.base,com.sun.enterprise.ee.cms.logging,com.sun.enterprise
.ee.cms.core,com.sun.enterprise.mgmt.transport,com.sun.enterprise.ee.
cms.impl.common,com.sun.enterprise.ee.cms.impl.client,com.sun.enterpr
ise.ee.cms.spi",
com.sun.enterprise.mgmt.transport;uses:="com.sun.ente
rprise.ee.cms.impl.base,com.sun.enterprise.ee.cms.logging,com.sun.ent
erprise.ee.cms.core,com.sun.enterprise.ee.cms.impl.common,com.sun.ent
erprise.ee.cms.spi",

com.sun.enterprise.mgmt.transport.grizzly;uses:="
com.sun.grizzly,com.sun.grizzly.util,com.sun.grizzly.connectioncache.
server,com.sun.enterprise.mgmt.transport,com.sun.grizzly.connectionca
che.spi.transport,com.sun.enterprise.mgmt,com.sun.enterprise.ee.cms.i
mpl.common,com.sun.enterprise.ee.cms.impl.base,com.sun.grizzly.connec
tioncache.client,com.sun.grizzly.async,com.sun.grizzly.filter"

Additional info is needed to continue any further investigation of this issue from group-management-service
point of view.


varunrupela added a comment - 05/Jan/11 04:26 AM

We have started a re-run of this scenario with the latest build to collect more information on issue http://java.net/jira/browse/GLASSFISH-15426. Will also check if this issue still exists


Joe Fialli added a comment - 05/Jan/11 05:11 AM

If the re-run has just restarted, it is possible to verify the Shoal GMS code path by doing the following.

Select one of the clustered instances to be killed using "kill -9".
Shoal GMS will generate the FailureSuspected notification 6 seconds later.
If this fails as reported at beginning of run, the failure is not stress related,
but would have something to do with running on Windows with Jrocket. (No GMS QE test
have been run with jrocket or on Windows at this time.)


Joe Fialli added a comment - 05/Jan/11 07:17 AM

I have verified running with jrockit-jdk1.6.0_20-R28.1.0-4.0.1 (from public website, do not know how to get the reported jrockit version of 1.6.0_22) on Windows XP 32 bit that the reported exception NoClassDefFoundError does not occur. I have attached complete server logs of the run.

We are also performing a GMS QE run with Jrockit on OEL.


Joe Fialli added a comment - 05/Jan/11 07:22 AM

DAS and instance logs running on windows xp with jrockit 1.6_0_20.
Illustrates that FailureSuspectedSignalImpl NoClassDefFound did not occur in simple
Shoal GMS run. Thus, reported failure is suspected to be stress run impact on class loading.


Joe Fialli made changes - 05/Jan/11 07:22 AM
Joe Fialli added a comment - 05/Jan/11 07:26 AM

Abbreviated server.log showing jrockit version and FailureSuspected being reported.

Jan 5, 2011 9:53:10 AM com.sun.enterprise.admin.launcher.GFLauncherLogger info
INFO: JVM invocation command line:
C:\jrockit-jdk1.6.0_20-R28.1.0-4.0.1\bin\java.exe

<deleted>
[#|2011-01-05T09:54:17.625-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1092: GMS View Change Received for group: myCluster : Members in view for IN_DOUBT_EVENT(before change analysis) are :
1: MemberId: instance01, MemberType: CORE, Address: 129.148.6.186:9121:228.9.1.3:2231:myCluster:instance01
2: MemberId: instance02, MemberType: CORE, Address: 129.148.6.186:9130:228.9.1.3:2231:myCluster:instance02
3: MemberId: server, MemberType: SPECTATOR, Address: 129.148.6.186:9173:228.9.1.3:2231:myCluster:server

#]

[#|2011-01-05T09:54:17.625-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1016: Analyzing new membership snapshot received as part of event: IN_DOUBT_EVENT for member: instance01 of group: myCluster|#]

[#|2011-01-05T09:54:17.625-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1007: Received FailureSuspectedEvent for member: instance01 of group: myCluster|#]

[#|2011-01-05T09:54:17.625-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=22;_ThreadName=Thread-3;|GMS1005: Sending FailureSuspectedSignals to registered Actions. member: instance01 ...|#]

[#|2011-01-05T09:54:21.547-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1092: GMS View Change Received for group: myCluster : Members in view for FAILURE_EVENT(before change analysis) are :
1: MemberId: instance02, MemberType: CORE, Address: 129.148.6.186:9130:228.9.1.3:2231:myCluster:instance02
2: MemberId: server, MemberType: SPECTATOR, Address: 129.148.6.186:9173:228.9.1.3:2231:myCluster:server

#]

[#|2011-01-05T09:54:21.547-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1016: Analyzing new membership snapshot received as part of event: FAILURE_EVENT for member: instance01 of group: myCluster|#]

[#|2011-01-05T09:54:21.547-0500|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-3;|GMS1019: member: instance01 of group: myCluster has failed|#]

The above sequence of log events included a NoClassDefFound for FailureSuspectedSignalImpl in this report.
So it is definitely working in non-stress case.


Joe Fialli added a comment - 05/Jan/11 10:39 AM

Unable to proceed that this is a group-management-service issue with current information.

Previous comments on this issue document my attempts to recreate the reported failure in a
simplified Shoal GMS test on windows xp platform (32 bit) using jrockit 1.6_0_20. The
simple test demonstrates that the package exports for Shoal GMS are correct and working.
The attached logs illustrate that for this simple run the class FailureSuspectedSignalImpl
was found and that NoClassDefFound did not occur. At this time, there is no evidence that
this is a a GMS issue but it looks more like a class loading issue caused by the stress test.
The issue may have been caused by previous out of memory exceptions or some other bad conditions.
But in simple terms, Shoal GMS SuspectedFailure notification has been verified with jrockit on
both windows and linux. We have successfully run all Shoal GMS QE test on OEL with jrockit 1.6_0_20
at least one run of each test. We have started 5 iterations of each scenario just to make sure.


Joe Fialli made changes - 05/Jan/11 10:39 AM
Resolution Incomplete [ 4 ]
Status Open [ 1 ] Closed [ 6 ]