[SAILFIN-1531] Instance hang during startup Created: 18/Dec/08  Updated: 13/Feb/09  Resolved: 13/Feb/09

Status: Resolved
Project: sailfin
Component/s: deployment
Affects Version/s: 1.0
Fix Version/s: milestone 1

Type: Bug Priority: Minor
Reporter: babbisx Assignee: Yamini K B
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: All


Attachments: Java Archive File 1531-fix-2.jar     Java Archive File 1531-fix.jar     Java Archive File debug-patch-jan22.jar     File FailStart15h05m.tgz     File FailStart_17h26m.tgz     Text File NA-server.log     Text File server2.log     Text File SI-server-FINEST.log     Text File SI-server.log    
Issuezilla Id: 1,531
Tags: sf10-na

 Description   

Hi,

we are running an SGCS (V05 P2 B04) cluster.

A very frequent error that we are facing is that the lifecycle module of SJSMQ
JMS Resource Adapter seems to prevent the startup of the service instance.
This should be due to a failure to create connection to localhost:37676 as seen
to log fragment :

[#|2008-12-10T16:07:59.592+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.resource.resourceadapter|_ThreadID=11;_ThreadName=main;|JMS
Service Connection URL is :mq://SC_2_2:37676/,mq://SC_2_1:37676/|#]

[#|2008-12-10T16:07:59.609+0100|INFO|sun-glassfish-comms-server1.5|javax.resourceadapter.mqjmsra.lifecycle|_ThreadID=11;_ThreadName=main;|MQJMSRA_RA1101:
SJSMQ JMS Resource Adapter starting...|#]

[#|2008-12-10T16:07:59.656+0100|INFO|sun-glassfish-comms-server1.5|javax.resourceadapter.mqjmsra.lifecycle|_ThreadID=11;_ThreadName=main;|MQJMSRA_LB1101:
Looking for Broker Running at:localhost:37676|#]

[#|2008-12-10T16:07:59.719+0100|WARNING|sun-glassfish-comms-server1.5|javax.jms|_ThreadID=11;_ThreadName=main;_RequestID=d6b40d60-5d19-48a5-a87c-b58efbf24378;|[C4003]:
Error occurred on connection creation [localhost:37676]. - cause:
java.net.ConnectException: Connection refused|#]

As we tested, that port receives connections though....
Can this be a logical error or a race condition ?

BR

Babbis



 Comments   
Comment by babbisx [ 18/Dec/08 ]

Created an attachment (id=889)
Server log for the server instance

Comment by prasads [ 18/Dec/08 ]

RE-assign to Satish.

Comment by Satish Kumar [ 21/Dec/08 ]

Can you pls attach the MQ broker logs and the domain.xml for your setup? You
will also need to attach server.logs with the debug level for the JMS module
set to FINEST.

Reducing priority to P3 since it is not a stopper...

Comment by Satish Kumar [ 21/Dec/08 ]

Although there is a message in the log that seems to suggest an error in
connecting to the broker, this does not seem to be preventing the server from
starting up. Also, note the message is an 'INFO' message, not a 'SEVERE' error.

Comment by babbisx [ 02/Jan/09 ]

Hi,

This is a WARNING, not an INFO :

[#|2008-12-10T16:07:59.719+0100|WARNING|sun-glassfish-comms-server1.5|javax.jms|_ThreadID=11;_ThreadName=main;_RequestID=d6b40d60-5d19-48a5-a87c-b58efbf24378;|[C4003]:
Error occurred on connection creation [localhost:37676]. - cause:
java.net.ConnectException: Connection refused|#]

The bad thing is that is one of our most frequent errors, that prevents our big
clusters to fully come up....
I have some trouble due to the holidays to aquirr full logs, but i will come
back to you with the logs you require asap.

BR

Babbis

Comment by prasads [ 09/Jan/09 ]

"Reclassifying as P4 because these issues are not must fix for SailFin 1.0 release.
This issue will be scrubbed after this release and will be given the right
priority for SailFin 2.0 release."

Comment by ocorbun [ 16/Jan/09 ]

I am working on the same department as Babbis.

On my system, I can see these "Connection refused" even when the server instance
is starting fine. So I am not sure this starting problem is related to that or JMS.

On the other hand, the difference between a normal start and a failing start is
that the hidden MEjbApp is correctly loaded or not. When it fails,
ejb_container_timer_app is correctly loaded but then a lot of GMS activities
start to happen. MEjbApp gets never loaded and out lifecycle module are never
kicked.

When it fails to start, our lifecycle modules are never called:
[#|2009-01-13T14:14:58.705+0100|WARNING|sun-glassfish-comms-server1.5|javax.enterprise.system.container.ejb|_ThreadID=11;_ThreadName=main;TimerBean;_RequestID=77830286-564f-4021-9b61-de37e82890d0;|EJB5108:Unable
to initialize EJB Timer Service. The likely cause is the database has not been
started or the timer database table has not been created.|#]
[#|2009-01-13T14:14:58.705+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.core.classloading|ThreadID=11;_ThreadName=main;_ejb_container_timer_app;|LDR5010:
All ejb(s) of [__ejb_container_timer_app] loaded successfully!|#]
[#|2009-01-13T14:15:04.040+0100|INFO|sun-glassfish-comms-server1.5|ShoalLogger|_ThreadID=14;_ThreadName=ViewWindowThread;|GMS
View Change Received for group traffic : Members in view for (before change
analysis) are : ...

When it works, the lifecycle module are called after EJB initialization:
[#|2009-01-13T14:25:21.500+0100|WARNING|sun-glassfish-comms-server1.5|javax.enterprise.system.container.ejb|_ThreadID=11;_ThreadName=main;TimerBean;_RequestID=98b22ae2-e743-42d8-b16e-3aed6e710f95;|EJB5108:Unable
to initialize EJB Timer Service. The likely cause is the database has not been
started or the timer database table has not been created.|#]
[#|2009-01-13T14:25:21.501+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.core.classloading|ThreadID=11;_ThreadName=main;_ejb_container_timer_app;|LDR5010:
All ejb(s) of [__ejb_container_timer_app] loaded successfully!|#]
[#|2009-01-13T14:25:22.205+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.core.classloading|_ThreadID=11;_ThreadName=main;MEjbApp;|LDR5010:
All ejb(s) of [MEjbApp] loaded successfully!|#]
[#|2009-01-13T14:25:22.212+0100|INFO|sun-glassfish-comms-server1.5|com.ericsson.mmas.lifecycle.CAF|_ThreadID=11;_ThreadName=main;|CAF:
Registering MBeanServer into JNDI... |#]

Comment by ocorbun [ 16/Jan/09 ]

Created an attachment (id=931)
Failing instance log

Comment by ocorbun [ 16/Jan/09 ]

Created an attachment (id=932)
NA log during failing instance

Comment by ocorbun [ 16/Jan/09 ]

Created an attachment (id=933)
Failing instance detailed log

Comment by ocorbun [ 16/Jan/09 ]

last log with FINEST on GMS, EJB and JMS

Comment by Satish Kumar [ 19/Jan/09 ]

The JMS Exception in the server.log is just a warning that can safely be
ignored. There is another message a few lines after this stating that the
connection to the broker has been successfully established -

[#|2009-01-16T13:11:13.832+0100|INFO|sun-glassfish-comms-
server1.5|javax.resourceadapter.mqjmsra.lifecycle|_ThreadID=11;_ThreadName=main;

MQJMSRA_RA1101: SJSMQ JMSRA Started:LOCAL #]

However, there are quite a few config exceptions in the Node Agent logs that
need to be investigated.

Reassigning the issue to Yamini in the SailFin Admin team to investigate and
reassign to the right person.

Comment by Yamini K B [ 19/Jan/09 ]

I looked at all the logs. Following the sequence of events based on time stamps,
there appears to have been a synchronization failure while starting NA.

Time 1: Something happened while synchronizing the instance domain.xml The file
was read incorrectly (possible network error??), hence the SAX exception.

[#|2009-01-16T13:10:26.192+0100|WARNING|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=10;_ThreadName=main;|NAGT0002:An
exception has occurred during the sychronization of this node with the DAS.
com.sun.enterprise.config.ConfigException: Error refreshing
ConfigContext:/opt/sailfin-v1-b60g/nodedata/nodeagents/PL_2_3/agent/config/domain.xml
cause: Failed to create the XML-DOM Document. Check your XML to make sure it is
correct.
Premature end of file.

Time 2: (probably a second attempt of start-node-agent)

Jan 16, 2009 1:10:42 PM com.sun.enterprise.admin.servermgmt.launch.ASLauncher
buildCommand
INFO:

Time 3: Synchronization succeeds and server instance start is initiated by NA

[#|2009-01-16T13:11:02.587+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI
TCP Connection(11)-192.168.0.254;traffic_instance_PL_2_3;|NodeAgent
synchronizing for instance|#]

[#|2009-01-16T13:11:04.158+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI
TCP
Connection(11)-192.168.0.254;traffic_instance_PL_2_3;SUCCEEDED!;1570;|NAGT0039:Synchronization
for traffic_instance_PL_2_3 - SUCCEEDED! and took 1570 milliseconds|#]

[#|2009-01-16T13:11:04.158+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI
TCP Connection(11)-192.168.0.254;|NA-DEBUG: Release Lock: traffic_instance_PL_2_3|#]

[#|2009-01-16T13:11:04.158+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI
TCP Connection(11)-192.168.0.254;|NA-DEBUG: Getting ready to have ProcessManger
start the instance: traffic_instance_PL_2_3|#]

However, there are some SEVERE messages from tomcat:

[#|2009-01-16T16:22:41.729+0100|SEVERE|sun-glassfish-comms-server1.5|org.apache.coyote.tomcat5.CoyoteConnector|_ThreadID=19;_ThreadName=RMI
TCP
Connection(7)-192.168.0.254;_RequestID=9472aa4d-7411-4e5e-9db1-39d433a8af4a;|PWC3983:
Coyote connector has not been started|#]

[#|2009-01-16T16:22:41.898+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.clb|_ThreadID=19;_ThreadName=RMI
TCP Connection(7)-192.168.0.254;|Sailfin_Clb_pipeline_created|#]

[#|2009-01-16T16:22:41.952+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.web|_ThreadID=19;_ThreadName=RMI
TCP Connection(7)-192.168.0.254;38080;|WEB0712: Starting
Sun-Java-System/Application-Server HTTP/1.1 on 38080|#]

[#|2009-01-16T16:22:42.475+0100|SEVERE|sun-glassfish-comms-server1.5|org.apache.coyote.tomcat5.CoyoteConnector|_ThreadID=19;_ThreadName=RMI
TCP
Connection(7)-192.168.0.254;_RequestID=9472aa4d-7411-4e5e-9db1-39d433a8af4a;|PWC3983:
Coyote connector has not been started|#]

Since there are no other exceptions/messages from tomcat, unable to guess what
could have gone wrong. Cc'ing Jan Luehe for his comments.

ocorbun, are the instance logs complete?

Jan, wouldn't the coyote connector error initiate a server shutdown (no server
shutdown messages seen)?

Comment by ocorbun [ 19/Jan/09 ]

This cluster is run on top of SAF. Since Node Agents and Server Instances are
started by the Availability Management Framework (AMF), if a server fails to
start or takes too long time (not answering first AMF healthcheck), the log is
saved (attached in this issue), the instance is killed and deleted. A new server
instance is created via asadmin and started.
This is maybe why you do not see any shutdown message in the attached log.
Otherwise, yes it is a complete server instance log from start to hanging start.

Olivier
ERICSSON ///

Comment by Yamini K B [ 19/Jan/09 ]

Thanks for the input Olivier. Some more questions:
1. Do you see the NA startup error (synchronization error) often?
2. How long does AMF wait before deciding that server startup failed?

Comment by ocorbun [ 19/Jan/09 ]

> 1. Do you see the NA startup error (synchronization error) often?
Not really. I will check if I can see it when it starts properly or if it occurs
every time it fails. I need to do more testing.

> 2. How long does AMF wait before deciding that server startup failed?
AMF waits 15 minutes in total including NA start/sync and SI start. We have a
lifecycle module in the SI that can answer AMF healthcheck. But when it fails
the lifecycle module never get the init event.

Comment by ocorbun [ 20/Jan/09 ]

> > 1. Do you see the NA startup error (synchronization error) often?
> Not really. I will check if I can see it when it starts properly
> or if it occurs every time it fails. I need to do more testing.

Yes the synchronization problem can occur even when the instance starts
normally. But it is rare. And we can also see that in case the instance fails to
start within 15 minutes, this problem is also rarely seen. So you can rule it
out for this particular issue.

Comment by jluehe [ 21/Jan/09 ]

This error that was seen in the log:

SEVERE|sun-glassfish-comms-server1.5|
org.apache.coyote.tomcat5.CoyoteConnector|
...|PWC3983: Coyote connector has not been started|#]

is printed by this code:

public void stop() throws LifecycleException {

// Validate and update our current state
if (!started)

{ log.error(sm.getString("coyoteConnector.notStarted")); return; }

inside

appserv-webtier/src/java/org/apache/coyote/tomcat5/CoyoteConnector#stop

For some reason, the HTTP listener is being asked to stop, before it has even
been started!

This normally indicates that an error has occurred during the instance startup,
which causes all containers (including the web container) to shut down. When the
web container shuts down, it shuts down all its HTTP listeners.

Would it be possible to prepare a patch that would dump the thread stack after
printing the error message, and before returning, like this:

public void stop() throws LifecycleException {

// Validate and update our current state
if (!started)

{ log.error(sm.getString("coyoteConnector.notStarted")); Thread.currentThread().dumpStack(); // <---- NEW return; }

That would tell us what's triggering the stop().

Comment by Yamini K B [ 21/Jan/09 ]

Thanks for your comments Jan, I think its the SAF notification thats initiating
the web container shutdown.

Olivier, I'm attaching a debug patch for CoyoteConnector, would appreciate if
you can try it out. We can confirm the reason for the SEVERE message. Please
include the jar in the classpath-prefix of the cluster config and ensure that
the jar is accessible by the instances.

Found something interesting in the log messages:

[#|2009-01-16T*13:11:21.552+0100*|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.clb|_ThreadID=17;_ThreadName=pool-2-thread-2;traffic;|CLBC0129:
CLB and GMS view for cluster traffic are in sync.|#]

[#|2009-01-16T*13:24:35.110+0100*|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.clb|_ThreadID=18;_ThreadName=RMI
TCP Connection(7)-192.168.0.254;server;|Start: Synchronizing CLB XML file from
DAS. DAS NAME: server|#]

Notice the time difference (13min) between the above messages. The second
instane log also shows similar difference.

Olivier, can you please turn on FINEST mode for clb and resend the instance log?

Comment by Yamini K B [ 21/Jan/09 ]

Created an attachment (id=942)
CoyoteConnector debug patch

Comment by Yamini K B [ 21/Jan/09 ]

One more suggestion, would it be possible to take stack dump (using jstack) on
the failing instance during the time gap?

Comment by Yamini K B [ 22/Jan/09 ]

I happened to notice that JVM is being tuned as follows:

-server
-XX:+UnlockDiagnosticVMOptions
-XX:MaxPermSize=192m
-Xmx512m
-XX:NewRatio=2
-Xms3000m
-Xmx3000m
-Xss128k
-XX:SurvivorRatio=128
-XX:+DisableExplicitGC
-XX:PermSize=64m
-XX:MaxPermSize=128m
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:LargePageSizeInBytes=2m
-XX:ParallelGCThreads=7
-Xmn128m
-XX:MaxTenuringThreshold=8
-XX:CMSInitiatingOccupancyFraction=25
-XX:+LogVMOutput
-XX:LogFile=/opt/sailfin-v1-b60g/nodedata/nodeagents/PL_2_4/traffic_instance_PL_2_4/logs/jvm.log

Some thoughts:
1. Please remove the duplicate settings (-Xmx, MaxPermSize)
2. Generally young generation size (-Xmn) is tuned to 1/4 of total heap.
3. Can you try using the -verbosegc -XX:+PrintGCDetails to see if VM is causing
any pauses during the startup?

Comment by ocorbun [ 22/Jan/09 ]

Created an attachment (id=944)
Logs with patch prefixed

Comment by ocorbun [ 22/Jan/09 ]

Last log attached.

This includes:

  • patch prefixed (Coyote stack visible)
  • added VM args: -verbosegc -XX:+PrintGCDetails
  • removed duplicated VM args
  • FINEST on clb and ejb

jstack did not work on the hanging start process.

Comment by Yamini K B [ 23/Jan/09 ]

Changed the summary.

Analyzing the stack trace, there appears to be a deadlock happening in ORB's
FOLB code:

"pool-6-thread-2" prio=10 tid=0x00002aab799fd000 nid=0xb5a in
Object.wait() [0x0000000042524000..0x0000000042524b20]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)

  • waiting on <0x00002aaab92081b0> (a
    com.sun.corba.ee.impl.folb.ServerGroupManager$1)
    at java.lang.Thread.join(Thread.java:1143)
  • locked <0x00002aaab92081b0> (a
    com.sun.corba.ee.impl.folb.ServerGroupManager$1)
    at java.lang.Thread.join(Thread.java:1196)
    at
    com.sun.corba.ee.impl.folb.ServerGroupManager.restartFactories(ServerGroupManager.java:494)
    at
    com.sun.corba.ee.impl.folb.ServerGroupManager.membershipChange(ServerGroupManager.java:391)
    at
    com.sun.corba.ee.spi.folb.GroupInfoServiceBase.notifyObservers(GroupInfoServiceBase.java:76)
    at
    com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.addMember(IiopFolbGmsClient.java:396)
  • locked <0x00002aaab90be688> (a java.util.HashMap)
    at
    com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.handleSignal(IiopFolbGmsClient.java:310)
    at
    com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.consumeSignal(IiopFolbGmsClient.java:178)
    at
    com.sun.enterprise.ee.cms.impl.common.Router$CallableAction.call(Router.java:509)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at
    java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
    at java.lang.Thread.run(Thread.java:619)

This issue will need to be assigned to the right owner.

Comment by Yamini K B [ 23/Jan/09 ]

The main thread which is waiting for the lock:

"main" prio=10 tid=0x0000000040118000 nid=0xa8b waiting for monitor
entry [0x0000000040027000..0x0000000040028f60]
java.lang.Thread.State: BLOCKED (on object monitor)
at
com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient$GIS.getClusterInstanceInfo(IiopFolbGmsClient.java:215)

  • waiting to lock <0x00002aaab90be688> (a java.util.HashMap)
    at
    com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.getClusterInstanceInfo(IiopFolbGmsClient.java:276)
    at
    com.sun.corba.ee.impl.folb.ServerGroupManager.establish_components(ServerGroupManager.java:242)
    at
    com.sun.corba.ee.impl.interceptors.InterceptorInvoker.objectAdapterCreated(InterceptorInvoker.java:145)
    at
    com.sun.corba.ee.impl.interceptors.PIHandlerImpl.objectAdapterCreated(PIHandlerImpl.java:297)
    at
    com.sun.corba.ee.spi.oa.ObjectAdapterBase.initializeTemplate(ObjectAdapterBase.java:139)
    at com.sun.corba.ee.impl.oa.poa.POAImpl.initialize(POAImpl.java:415)
    at com.sun.corba.ee.impl.oa.poa.POAImpl.create_POA(POAImpl.java:815)
    at
    com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl$AdapterActivatorImpl.unknown_adapter(ReferenceFactoryManagerImpl.java:132)
    at com.sun.corba.ee.impl.oa.poa.POAImpl.find_POA(POAImpl.java:928)
  • locked <0x00002aaab90abb58> (a
    com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl$AdapterActivatorImpl)
    at
    com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl.createReference(ReferenceFactoryManagerImpl.java:483)
    at
    com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryImpl.createReference(ReferenceFactoryImpl.java:58)
    at
    com.sun.enterprise.iiop.POARemoteReferenceFactory._createRef(POARemoteReferenceFactory.java:412)
    at
    com.sun.enterprise.iiop.POARemoteReferenceFactory.createRef(POARemoteReferenceFactory.java:383)
    at
    com.sun.enterprise.iiop.POARemoteReferenceFactory.createHomeReference(POARemoteReferenceFactory.java:374)
    at com.sun.ejb.containers.BaseContainer.initializeHome(BaseContainer.java:933)
    at
    com.sun.ejb.containers.StatelessSessionContainer.initializeHome(StatelessSessionContainer.java:232)
    at
    com.sun.ejb.containers.ContainerFactoryImpl.createContainer(ContainerFactoryImpl.java:345)
    at com.sun.enterprise.server.AbstractLoader.loadEjbs(AbstractLoader.java:527)
    at com.sun.enterprise.server.ApplicationLoader.doLoad(ApplicationLoader.java:191)
    at
    com.sun.enterprise.server.TomcatApplicationLoader.doLoad(TomcatApplicationLoader.java:126)
    at
    com.sun.enterprise.server.ExtendedApplicationLoader.doLoad(ExtendedApplicationLoader.java:134)
    at com.sun.enterprise.server.AbstractLoader.load(AbstractLoader.java:238)
    at
    com.sun.enterprise.server.AbstractManager.loadOneSystemApp(AbstractManager.java:407)
    at
    com.sun.enterprise.server.AbstractManager$SystemAppStarter.doRun(AbstractManager.java:664)
    at com.sun.appserv.management.util.misc.RunnableBase.runSync(RunnableBase.java:304)
    at com.sun.appserv.management.util.misc.RunnableBase._submit(RunnableBase.java:176)
    at com.sun.appserv.management.util.misc.RunnableBase.submit(RunnableBase.java:210)
    at com.sun.enterprise.server.AbstractManager.loadSystem(AbstractManager.java:331)
    at
    com.sun.enterprise.server.SystemAppLifecycle.loadSystemApps(SystemAppLifecycle.java:162)
    at
    com.sun.enterprise.server.SystemAppLifecycle.onStartup(SystemAppLifecycle.java:108)
    at
    com.sun.enterprise.server.ApplicationServer.onStartup(ApplicationServer.java:446)
    at
    com.sun.enterprise.server.ondemand.OnDemandServer.onStartup(OnDemandServer.java:134)
    at com.sun.enterprise.server.PEMain.run(PEMain.java:409)
    at com.sun.enterprise.server.PEMain.main(PEMain.java:336)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at com.sun.enterprise.server.PELaunch.main(PELaunch.java:415)
Comment by skgaju [ 24/Jan/09 ]

adding myself to cc list

Comment by Yamini K B [ 27/Jan/09 ]

Attaching the fix for hang, please include it in classpath-prefix of cluster
config and ensure that the jar is accessible by all instances.

Comment by Yamini K B [ 27/Jan/09 ]

Created an attachment (id=947)
Fix for hang

Comment by ocorbun [ 27/Jan/09 ]

Created an attachment (id=948)
It fails with the fix as well. Coyote error not in this log but we have seen it even with the fix. Here we can see again that MEjbApp is not loaded.

Comment by Yamini K B [ 27/Jan/09 ]

Olivier,

I'm sorry, I did a mistake in packaging the patch (though I did have the proper
patch in my workspace while running sanity tests). Uploading the proper patch,
please try with this.

Comment by Yamini K B [ 27/Jan/09 ]

Created an attachment (id=949)
Re-packaged the fix

Comment by ocorbun [ 27/Jan/09 ]

With the patch, I was not able to reproduce the problem on a 4 blades system.
We will try tomorrow on a 20 blades system.

Olivier
ERICSSON ///

Comment by Joe Fialli [ 28/Jan/09 ]

Adding myself to CC list as someone who could investigate the IiopFolbGmsClient
jstack stack traces attached to this issue. I would be interested in full jstack
output. Currently, only a summary is available in descriptions of this issue.

Comment by ocorbun [ 28/Jan/09 ]

After testing one day on a 20 blades system (plus 2 other smaller systems), we
can say that the problem is gone with this fix. Thanks!

Please, explain what this fix does.

Comment by Yamini K B [ 28/Jan/09 ]

The hang is due to a synchronization problem in IIopFolbGmsClient code. It is
holding a lock on currentMembers when adding a member and calls
notifyObservers() which lands in waiting for thread join to avoid another
deadlock. At the same time another request thread is calling
getClusterInstanceInfo() and waiting for the currentMembers lock. The fix is to
move the notifyObservers() call out of the synchronized block in both
addMember() and removeMember().

Comment by Yamini K B [ 28/Jan/09 ]

Fix checked in on FCS branch:

Checking in ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java;
/cvs/glassfish/appserv-core-ee/ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java,v
<-- IiopFolbGmsClient.java
new revision: 1.2.6.3.2.1; previous revision: 1.2.6.3
done

Comment by Yamini K B [ 28/Jan/09 ]

Fix checked in on truck:

Checking in ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java;
/cvs/glassfish/appserv-core-ee/ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java,v
<-- IiopFolbGmsClient.java
new revision: 1.2.6.4; previous revision: 1.2.6.3
done

Comment by adrian77 [ 13/Feb/09 ]

Fix works.

Generated at Sat Mar 28 13:02:00 UTC 2015 using JIRA 6.2.3#6260-sha1:63ef1d6dac3f4f4d7db4c1effd405ba38ccdc558.