Issue Details (XML | Word | Printable)

Key: SAILFIN-1531
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Minor Minor
Assignee: Yamini K B
Reporter: babbisx
Votes: 0
Watchers: 5
Operations

If you were logged in you would be able to see more operations.
sailfin

Instance hang during startup

Created: 18/Dec/08 04:13 PM   Updated: 13/Feb/09 05:19 AM   Resolved: 13/Feb/09 05:19 AM
Component/s: deployment
Affects Version/s: 1.0
Fix Version/s: milestone 1

Time Tracking:
Not Specified

File Attachments: 1. Java Archive File 1531-fix-2.jar (8 kB) 27/Jan/09 07:06 AM - Yamini K B
2. Java Archive File 1531-fix.jar (8 kB) 27/Jan/09 01:54 AM - Yamini K B
3. Java Archive File debug-patch-jan22.jar (11 kB) 21/Jan/09 08:11 PM - Yamini K B
4. File FailStart15h05m.tgz (290 kB) 27/Jan/09 06:23 AM - ocorbun
5. File FailStart_17h26m.tgz (158 kB) 22/Jan/09 08:54 AM - ocorbun
6. Text File NA-server.log (19 kB) 16/Jan/09 06:51 AM - ocorbun
7. Text File server2.log (113 kB) 18/Dec/08 04:14 PM - babbisx
8. Text File SI-server-FINEST.log (155 kB) 16/Jan/09 07:39 AM - ocorbun
9. Text File SI-server.log (42 kB) 16/Jan/09 06:50 AM - ocorbun

Environment:

Operating System: Linux
Platform: All


Issuezilla Id: 1,531
Tags: sf10-na
Participants: adrian77, babbisx, jluehe, Joe Fialli, ocorbun, prasads, Satish Kumar, skgaju and Yamini K B


 Description  « Hide

Hi,

we are running an SGCS (V05 P2 B04) cluster.

A very frequent error that we are facing is that the lifecycle module of SJSMQ
JMS Resource Adapter seems to prevent the startup of the service instance.
This should be due to a failure to create connection to localhost:37676 as seen
to log fragment :

[#|2008-12-10T16:07:59.592+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.resource.resourceadapter|_ThreadID=11;_ThreadName=main;|JMS
Service Connection URL is :mq://SC_2_2:37676/,mq://SC_2_1:37676/|#]

[#|2008-12-10T16:07:59.609+0100|INFO|sun-glassfish-comms-server1.5|javax.resourceadapter.mqjmsra.lifecycle|_ThreadID=11;_ThreadName=main;|MQJMSRA_RA1101:
SJSMQ JMS Resource Adapter starting...|#]

[#|2008-12-10T16:07:59.656+0100|INFO|sun-glassfish-comms-server1.5|javax.resourceadapter.mqjmsra.lifecycle|_ThreadID=11;_ThreadName=main;|MQJMSRA_LB1101:
Looking for Broker Running at:localhost:37676|#]

[#|2008-12-10T16:07:59.719+0100|WARNING|sun-glassfish-comms-server1.5|javax.jms|_ThreadID=11;_ThreadName=main;_RequestID=d6b40d60-5d19-48a5-a87c-b58efbf24378;|[C4003]:
Error occurred on connection creation [localhost:37676]. - cause:
java.net.ConnectException: Connection refused|#]

As we tested, that port receives connections though....
Can this be a logical error or a race condition ?

BR

Babbis



babbisx added a comment - 18/Dec/08 04:14 PM

Created an attachment (id=889)
Server log for the server instance


prasads added a comment - 18/Dec/08 09:03 PM

RE-assign to Satish.


Satish Kumar added a comment - 21/Dec/08 09:28 AM

Can you pls attach the MQ broker logs and the domain.xml for your setup? You
will also need to attach server.logs with the debug level for the JMS module
set to FINEST.

Reducing priority to P3 since it is not a stopper...


Satish Kumar added a comment - 21/Dec/08 09:36 AM

Although there is a message in the log that seems to suggest an error in
connecting to the broker, this does not seem to be preventing the server from
starting up. Also, note the message is an 'INFO' message, not a 'SEVERE' error.


babbisx added a comment - 02/Jan/09 08:01 AM

Hi,

This is a WARNING, not an INFO :

[#|2008-12-10T16:07:59.719+0100|WARNING|sun-glassfish-comms-server1.5|javax.jms|_ThreadID=11;_ThreadName=main;_RequestID=d6b40d60-5d19-48a5-a87c-b58efbf24378;|[C4003]:
Error occurred on connection creation [localhost:37676]. - cause:
java.net.ConnectException: Connection refused|#]

The bad thing is that is one of our most frequent errors, that prevents our big
clusters to fully come up....
I have some trouble due to the holidays to aquirr full logs, but i will come
back to you with the logs you require asap.

BR

Babbis


prasads added a comment - 09/Jan/09 12:54 AM

"Reclassifying as P4 because these issues are not must fix for SailFin 1.0 release.
This issue will be scrubbed after this release and will be given the right
priority for SailFin 2.0 release."


ocorbun added a comment - 16/Jan/09 06:23 AM

I am working on the same department as Babbis.

On my system, I can see these "Connection refused" even when the server instance
is starting fine. So I am not sure this starting problem is related to that or JMS.

On the other hand, the difference between a normal start and a failing start is
that the hidden MEjbApp is correctly loaded or not. When it fails,
ejb_container_timer_app is correctly loaded but then a lot of GMS activities
start to happen. MEjbApp gets never loaded and out lifecycle module are never
kicked.

When it fails to start, our lifecycle modules are never called:
[#|2009-01-13T14:14:58.705+0100|WARNING|sun-glassfish-comms-server1.5|javax.enterprise.system.container.ejb|_ThreadID=11;_ThreadName=main;TimerBean;_RequestID=77830286-564f-4021-9b61-de37e82890d0;|EJB5108:Unable
to initialize EJB Timer Service. The likely cause is the database has not been
started or the timer database table has not been created.|#]
[#|2009-01-13T14:14:58.705+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.core.classloading|ThreadID=11;_ThreadName=main;_ejb_container_timer_app;|LDR5010:
All ejb(s) of [__ejb_container_timer_app] loaded successfully!|#]
[#|2009-01-13T14:15:04.040+0100|INFO|sun-glassfish-comms-server1.5|ShoalLogger|_ThreadID=14;_ThreadName=ViewWindowThread;|GMS
View Change Received for group traffic : Members in view for (before change
analysis) are : ...

When it works, the lifecycle module are called after EJB initialization:
[#|2009-01-13T14:25:21.500+0100|WARNING|sun-glassfish-comms-server1.5|javax.enterprise.system.container.ejb|_ThreadID=11;_ThreadName=main;TimerBean;_RequestID=98b22ae2-e743-42d8-b16e-3aed6e710f95;|EJB5108:Unable
to initialize EJB Timer Service. The likely cause is the database has not been
started or the timer database table has not been created.|#]
[#|2009-01-13T14:25:21.501+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.core.classloading|ThreadID=11;_ThreadName=main;_ejb_container_timer_app;|LDR5010:
All ejb(s) of [__ejb_container_timer_app] loaded successfully!|#]
[#|2009-01-13T14:25:22.205+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.core.classloading|_ThreadID=11;_ThreadName=main;MEjbApp;|LDR5010:
All ejb(s) of [MEjbApp] loaded successfully!|#]
[#|2009-01-13T14:25:22.212+0100|INFO|sun-glassfish-comms-server1.5|com.ericsson.mmas.lifecycle.CAF|_ThreadID=11;_ThreadName=main;|CAF:
Registering MBeanServer into JNDI... |#]


ocorbun added a comment - 16/Jan/09 06:50 AM

Created an attachment (id=931)
Failing instance log


ocorbun added a comment - 16/Jan/09 06:51 AM

Created an attachment (id=932)
NA log during failing instance


ocorbun added a comment - 16/Jan/09 07:39 AM

Created an attachment (id=933)
Failing instance detailed log


ocorbun added a comment - 16/Jan/09 07:40 AM

last log with FINEST on GMS, EJB and JMS


Satish Kumar added a comment - 19/Jan/09 02:13 AM

The JMS Exception in the server.log is just a warning that can safely be
ignored. There is another message a few lines after this stating that the
connection to the broker has been successfully established -

[#|2009-01-16T13:11:13.832+0100|INFO|sun-glassfish-comms-
server1.5|javax.resourceadapter.mqjmsra.lifecycle|_ThreadID=11;_ThreadName=main;

MQJMSRA_RA1101: SJSMQ JMSRA Started:LOCAL #]

However, there are quite a few config exceptions in the Node Agent logs that
need to be investigated.

Reassigning the issue to Yamini in the SailFin Admin team to investigate and
reassign to the right person.


Yamini K B added a comment - 19/Jan/09 03:41 AM

I looked at all the logs. Following the sequence of events based on time stamps,
there appears to have been a synchronization failure while starting NA.

Time 1: Something happened while synchronizing the instance domain.xml The file
was read incorrectly (possible network error??), hence the SAX exception.

[#|2009-01-16T13:10:26.192+0100|WARNING|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=10;_ThreadName=main;|NAGT0002:An
exception has occurred during the sychronization of this node with the DAS.
com.sun.enterprise.config.ConfigException: Error refreshing
ConfigContext:/opt/sailfin-v1-b60g/nodedata/nodeagents/PL_2_3/agent/config/domain.xml
cause: Failed to create the XML-DOM Document. Check your XML to make sure it is
correct.
Premature end of file.

Time 2: (probably a second attempt of start-node-agent)

Jan 16, 2009 1:10:42 PM com.sun.enterprise.admin.servermgmt.launch.ASLauncher
buildCommand
INFO:

Time 3: Synchronization succeeds and server instance start is initiated by NA

[#|2009-01-16T13:11:02.587+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI
TCP Connection(11)-192.168.0.254;traffic_instance_PL_2_3;|NodeAgent
synchronizing for instance|#]

[#|2009-01-16T13:11:04.158+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI
TCP
Connection(11)-192.168.0.254;traffic_instance_PL_2_3;SUCCEEDED!;1570;|NAGT0039:Synchronization
for traffic_instance_PL_2_3 - SUCCEEDED! and took 1570 milliseconds|#]

[#|2009-01-16T13:11:04.158+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI
TCP Connection(11)-192.168.0.254;|NA-DEBUG: Release Lock: traffic_instance_PL_2_3|#]

[#|2009-01-16T13:11:04.158+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI
TCP Connection(11)-192.168.0.254;|NA-DEBUG: Getting ready to have ProcessManger
start the instance: traffic_instance_PL_2_3|#]

However, there are some SEVERE messages from tomcat:

[#|2009-01-16T16:22:41.729+0100|SEVERE|sun-glassfish-comms-server1.5|org.apache.coyote.tomcat5.CoyoteConnector|_ThreadID=19;_ThreadName=RMI
TCP
Connection(7)-192.168.0.254;_RequestID=9472aa4d-7411-4e5e-9db1-39d433a8af4a;|PWC3983:
Coyote connector has not been started|#]

[#|2009-01-16T16:22:41.898+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.clb|_ThreadID=19;_ThreadName=RMI
TCP Connection(7)-192.168.0.254;|Sailfin_Clb_pipeline_created|#]

[#|2009-01-16T16:22:41.952+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.web|_ThreadID=19;_ThreadName=RMI
TCP Connection(7)-192.168.0.254;38080;|WEB0712: Starting
Sun-Java-System/Application-Server HTTP/1.1 on 38080|#]

[#|2009-01-16T16:22:42.475+0100|SEVERE|sun-glassfish-comms-server1.5|org.apache.coyote.tomcat5.CoyoteConnector|_ThreadID=19;_ThreadName=RMI
TCP
Connection(7)-192.168.0.254;_RequestID=9472aa4d-7411-4e5e-9db1-39d433a8af4a;|PWC3983:
Coyote connector has not been started|#]

Since there are no other exceptions/messages from tomcat, unable to guess what
could have gone wrong. Cc'ing Jan Luehe for his comments.

ocorbun, are the instance logs complete?

Jan, wouldn't the coyote connector error initiate a server shutdown (no server
shutdown messages seen)?


ocorbun added a comment - 19/Jan/09 04:12 AM

This cluster is run on top of SAF. Since Node Agents and Server Instances are
started by the Availability Management Framework (AMF), if a server fails to
start or takes too long time (not answering first AMF healthcheck), the log is
saved (attached in this issue), the instance is killed and deleted. A new server
instance is created via asadmin and started.
This is maybe why you do not see any shutdown message in the attached log.
Otherwise, yes it is a complete server instance log from start to hanging start.

Olivier
ERICSSON ///


Yamini K B added a comment - 19/Jan/09 06:30 AM

Thanks for the input Olivier. Some more questions:
1. Do you see the NA startup error (synchronization error) often?
2. How long does AMF wait before deciding that server startup failed?


ocorbun added a comment - 19/Jan/09 07:53 AM

> 1. Do you see the NA startup error (synchronization error) often?
Not really. I will check if I can see it when it starts properly or if it occurs
every time it fails. I need to do more testing.

> 2. How long does AMF wait before deciding that server startup failed?
AMF waits 15 minutes in total including NA start/sync and SI start. We have a
lifecycle module in the SI that can answer AMF healthcheck. But when it fails
the lifecycle module never get the init event.


ocorbun added a comment - 20/Jan/09 07:59 AM

> > 1. Do you see the NA startup error (synchronization error) often?
> Not really. I will check if I can see it when it starts properly
> or if it occurs every time it fails. I need to do more testing.

Yes the synchronization problem can occur even when the instance starts
normally. But it is rare. And we can also see that in case the instance fails to
start within 15 minutes, this problem is also rarely seen. So you can rule it
out for this particular issue.


jluehe added a comment - 21/Jan/09 09:25 AM

This error that was seen in the log:

SEVERE|sun-glassfish-comms-server1.5|
org.apache.coyote.tomcat5.CoyoteConnector|
...|PWC3983: Coyote connector has not been started|#]

is printed by this code:

public void stop() throws LifecycleException {

// Validate and update our current state
if (!started) { log.error(sm.getString("coyoteConnector.notStarted")); return; }

inside

appserv-webtier/src/java/org/apache/coyote/tomcat5/CoyoteConnector#stop

For some reason, the HTTP listener is being asked to stop, before it has even
been started!

This normally indicates that an error has occurred during the instance startup,
which causes all containers (including the web container) to shut down. When the
web container shuts down, it shuts down all its HTTP listeners.

Would it be possible to prepare a patch that would dump the thread stack after
printing the error message, and before returning, like this:

public void stop() throws LifecycleException {

// Validate and update our current state
if (!started) { log.error(sm.getString("coyoteConnector.notStarted")); Thread.currentThread().dumpStack(); // <---- NEW return; }

That would tell us what's triggering the stop().


Yamini K B added a comment - 21/Jan/09 08:10 PM

Thanks for your comments Jan, I think its the SAF notification thats initiating
the web container shutdown.

Olivier, I'm attaching a debug patch for CoyoteConnector, would appreciate if
you can try it out. We can confirm the reason for the SEVERE message. Please
include the jar in the classpath-prefix of the cluster config and ensure that
the jar is accessible by the instances.

Found something interesting in the log messages:

[#|2009-01-16T*13:11:21.552+0100*|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.clb|_ThreadID=17;_ThreadName=pool-2-thread-2;traffic;|CLBC0129:
CLB and GMS view for cluster traffic are in sync.|#]

[#|2009-01-16T*13:24:35.110+0100*|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.clb|_ThreadID=18;_ThreadName=RMI
TCP Connection(7)-192.168.0.254;server;|Start: Synchronizing CLB XML file from
DAS. DAS NAME: server|#]

Notice the time difference (13min) between the above messages. The second
instane log also shows similar difference.

Olivier, can you please turn on FINEST mode for clb and resend the instance log?


Yamini K B added a comment - 21/Jan/09 08:11 PM

Created an attachment (id=942)
CoyoteConnector debug patch


Yamini K B added a comment - 21/Jan/09 11:24 PM

One more suggestion, would it be possible to take stack dump (using jstack) on
the failing instance during the time gap?


Yamini K B added a comment - 22/Jan/09 12:51 AM

I happened to notice that JVM is being tuned as follows:

-server
-XX:+UnlockDiagnosticVMOptions
-XX:MaxPermSize=192m
-Xmx512m
-XX:NewRatio=2
-Xms3000m
-Xmx3000m
-Xss128k
-XX:SurvivorRatio=128
-XX:+DisableExplicitGC
-XX:PermSize=64m
-XX:MaxPermSize=128m
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:LargePageSizeInBytes=2m
-XX:ParallelGCThreads=7
-Xmn128m
-XX:MaxTenuringThreshold=8
-XX:CMSInitiatingOccupancyFraction=25
-XX:+LogVMOutput
-XX:LogFile=/opt/sailfin-v1-b60g/nodedata/nodeagents/PL_2_4/traffic_instance_PL_2_4/logs/jvm.log

Some thoughts:
1. Please remove the duplicate settings (-Xmx, MaxPermSize)
2. Generally young generation size (-Xmn) is tuned to 1/4 of total heap.
3. Can you try using the -verbosegc -XX:+PrintGCDetails to see if VM is causing
any pauses during the startup?


ocorbun added a comment - 22/Jan/09 08:54 AM

Created an attachment (id=944)
Logs with patch prefixed


ocorbun added a comment - 22/Jan/09 08:58 AM

Last log attached.

This includes:

  • patch prefixed (Coyote stack visible)
  • added VM args: -verbosegc -XX:+PrintGCDetails
  • removed duplicated VM args
  • FINEST on clb and ejb

jstack did not work on the hanging start process.


Yamini K B added a comment - 23/Jan/09 12:05 AM

Changed the summary.

Analyzing the stack trace, there appears to be a deadlock happening in ORB's
FOLB code:

"pool-6-thread-2" prio=10 tid=0x00002aab799fd000 nid=0xb5a in
Object.wait() [0x0000000042524000..0x0000000042524b20]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)

  • waiting on <0x00002aaab92081b0> (a
    com.sun.corba.ee.impl.folb.ServerGroupManager$1)
    at java.lang.Thread.join(Thread.java:1143)
  • locked <0x00002aaab92081b0> (a
    com.sun.corba.ee.impl.folb.ServerGroupManager$1)
    at java.lang.Thread.join(Thread.java:1196)
    at
    com.sun.corba.ee.impl.folb.ServerGroupManager.restartFactories(ServerGroupManager.java:494)
    at
    com.sun.corba.ee.impl.folb.ServerGroupManager.membershipChange(ServerGroupManager.java:391)
    at
    com.sun.corba.ee.spi.folb.GroupInfoServiceBase.notifyObservers(GroupInfoServiceBase.java:76)
    at
    com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.addMember(IiopFolbGmsClient.java:396)
  • locked <0x00002aaab90be688> (a java.util.HashMap)
    at
    com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.handleSignal(IiopFolbGmsClient.java:310)
    at
    com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.consumeSignal(IiopFolbGmsClient.java:178)
    at
    com.sun.enterprise.ee.cms.impl.common.Router$CallableAction.call(Router.java:509)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at
    java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
    at java.lang.Thread.run(Thread.java:619)

This issue will need to be assigned to the right owner.


Yamini K B added a comment - 23/Jan/09 12:33 AM

The main thread which is waiting for the lock:

"main" prio=10 tid=0x0000000040118000 nid=0xa8b waiting for monitor
entry [0x0000000040027000..0x0000000040028f60]
java.lang.Thread.State: BLOCKED (on object monitor)
at
com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient$GIS.getClusterInstanceInfo(IiopFolbGmsClient.java:215)

  • waiting to lock <0x00002aaab90be688> (a java.util.HashMap)
    at
    com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.getClusterInstanceInfo(IiopFolbGmsClient.java:276)
    at
    com.sun.corba.ee.impl.folb.ServerGroupManager.establish_components(ServerGroupManager.java:242)
    at
    com.sun.corba.ee.impl.interceptors.InterceptorInvoker.objectAdapterCreated(InterceptorInvoker.java:145)
    at
    com.sun.corba.ee.impl.interceptors.PIHandlerImpl.objectAdapterCreated(PIHandlerImpl.java:297)
    at
    com.sun.corba.ee.spi.oa.ObjectAdapterBase.initializeTemplate(ObjectAdapterBase.java:139)
    at com.sun.corba.ee.impl.oa.poa.POAImpl.initialize(POAImpl.java:415)
    at com.sun.corba.ee.impl.oa.poa.POAImpl.create_POA(POAImpl.java:815)
    at
    com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl$AdapterActivatorImpl.unknown_adapter(ReferenceFactoryManagerImpl.java:132)
    at com.sun.corba.ee.impl.oa.poa.POAImpl.find_POA(POAImpl.java:928)
  • locked <0x00002aaab90abb58> (a
    com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl$AdapterActivatorImpl)
    at
    com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl.createReference(ReferenceFactoryManagerImpl.java:483)
    at
    com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryImpl.createReference(ReferenceFactoryImpl.java:58)
    at
    com.sun.enterprise.iiop.POARemoteReferenceFactory._createRef(POARemoteReferenceFactory.java:412)
    at
    com.sun.enterprise.iiop.POARemoteReferenceFactory.createRef(POARemoteReferenceFactory.java:383)
    at
    com.sun.enterprise.iiop.POARemoteReferenceFactory.createHomeReference(POARemoteReferenceFactory.java:374)
    at com.sun.ejb.containers.BaseContainer.initializeHome(BaseContainer.java:933)
    at
    com.sun.ejb.containers.StatelessSessionContainer.initializeHome(StatelessSessionContainer.java:232)
    at
    com.sun.ejb.containers.ContainerFactoryImpl.createContainer(ContainerFactoryImpl.java:345)
    at com.sun.enterprise.server.AbstractLoader.loadEjbs(AbstractLoader.java:527)
    at com.sun.enterprise.server.ApplicationLoader.doLoad(ApplicationLoader.java:191)
    at
    com.sun.enterprise.server.TomcatApplicationLoader.doLoad(TomcatApplicationLoader.java:126)
    at
    com.sun.enterprise.server.ExtendedApplicationLoader.doLoad(ExtendedApplicationLoader.java:134)
    at com.sun.enterprise.server.AbstractLoader.load(AbstractLoader.java:238)
    at
    com.sun.enterprise.server.AbstractManager.loadOneSystemApp(AbstractManager.java:407)
    at
    com.sun.enterprise.server.AbstractManager$SystemAppStarter.doRun(AbstractManager.java:664)
    at com.sun.appserv.management.util.misc.RunnableBase.runSync(RunnableBase.java:304)
    at com.sun.appserv.management.util.misc.RunnableBase._submit(RunnableBase.java:176)
    at com.sun.appserv.management.util.misc.RunnableBase.submit(RunnableBase.java:210)
    at com.sun.enterprise.server.AbstractManager.loadSystem(AbstractManager.java:331)
    at
    com.sun.enterprise.server.SystemAppLifecycle.loadSystemApps(SystemAppLifecycle.java:162)
    at
    com.sun.enterprise.server.SystemAppLifecycle.onStartup(SystemAppLifecycle.java:108)
    at
    com.sun.enterprise.server.ApplicationServer.onStartup(ApplicationServer.java:446)
    at
    com.sun.enterprise.server.ondemand.OnDemandServer.onStartup(OnDemandServer.java:134)
    at com.sun.enterprise.server.PEMain.run(PEMain.java:409)
    at com.sun.enterprise.server.PEMain.main(PEMain.java:336)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at com.sun.enterprise.server.PELaunch.main(PELaunch.java:415)

skgaju added a comment - 24/Jan/09 06:30 AM

adding myself to cc list


Yamini K B added a comment - 27/Jan/09 01:53 AM

Attaching the fix for hang, please include it in classpath-prefix of cluster
config and ensure that the jar is accessible by all instances.


Yamini K B added a comment - 27/Jan/09 01:54 AM

Created an attachment (id=947)
Fix for hang


ocorbun added a comment - 27/Jan/09 06:23 AM

Created an attachment (id=948)
It fails with the fix as well. Coyote error not in this log but we have seen it even with the fix. Here we can see again that MEjbApp is not loaded.


Yamini K B added a comment - 27/Jan/09 07:04 AM

Olivier,

I'm sorry, I did a mistake in packaging the patch (though I did have the proper
patch in my workspace while running sanity tests). Uploading the proper patch,
please try with this.


Yamini K B added a comment - 27/Jan/09 07:06 AM

Created an attachment (id=949)
Re-packaged the fix


ocorbun added a comment - 27/Jan/09 08:30 AM

With the patch, I was not able to reproduce the problem on a 4 blades system.
We will try tomorrow on a 20 blades system.

Olivier
ERICSSON ///


Joe Fialli added a comment - 28/Jan/09 06:34 AM

Adding myself to CC list as someone who could investigate the IiopFolbGmsClient
jstack stack traces attached to this issue. I would be interested in full jstack
output. Currently, only a summary is available in descriptions of this issue.


ocorbun added a comment - 28/Jan/09 06:52 AM

After testing one day on a 20 blades system (plus 2 other smaller systems), we
can say that the problem is gone with this fix. Thanks!

Please, explain what this fix does.


Yamini K B added a comment - 28/Jan/09 08:17 PM

The hang is due to a synchronization problem in IIopFolbGmsClient code. It is
holding a lock on currentMembers when adding a member and calls
notifyObservers() which lands in waiting for thread join to avoid another
deadlock. At the same time another request thread is calling
getClusterInstanceInfo() and waiting for the currentMembers lock. The fix is to
move the notifyObservers() call out of the synchronized block in both
addMember() and removeMember().


Yamini K B added a comment - 28/Jan/09 08:32 PM

Fix checked in on FCS branch:

Checking in ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java;
/cvs/glassfish/appserv-core-ee/ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java,v
<-- IiopFolbGmsClient.java
new revision: 1.2.6.3.2.1; previous revision: 1.2.6.3
done


Yamini K B added a comment - 28/Jan/09 11:53 PM

Fix checked in on truck:

Checking in ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java;
/cvs/glassfish/appserv-core-ee/ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java,v
<-- IiopFolbGmsClient.java
new revision: 1.2.6.4; previous revision: 1.2.6.3
done


adrian77 added a comment - 13/Feb/09 05:19 AM

Fix works.