Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.0
    • Fix Version/s: milestone 1
    • Component/s: deployment
    • Labels:
      None
    • Environment:

      Operating System: Linux
      Platform: All

    • Issuezilla Id:
      1,531
    • Tags:

      Description

      Hi,

      we are running an SGCS (V05 P2 B04) cluster.

      A very frequent error that we are facing is that the lifecycle module of SJSMQ
      JMS Resource Adapter seems to prevent the startup of the service instance.
      This should be due to a failure to create connection to localhost:37676 as seen
      to log fragment :

      [#|2008-12-10T16:07:59.592+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.resource.resourceadapter|_ThreadID=11;_ThreadName=main;|JMS
      Service Connection URL is :mq://SC_2_2:37676/,mq://SC_2_1:37676/|#]

      [#|2008-12-10T16:07:59.609+0100|INFO|sun-glassfish-comms-server1.5|javax.resourceadapter.mqjmsra.lifecycle|_ThreadID=11;_ThreadName=main;|MQJMSRA_RA1101:
      SJSMQ JMS Resource Adapter starting...|#]

      [#|2008-12-10T16:07:59.656+0100|INFO|sun-glassfish-comms-server1.5|javax.resourceadapter.mqjmsra.lifecycle|_ThreadID=11;_ThreadName=main;|MQJMSRA_LB1101:
      Looking for Broker Running at:localhost:37676|#]

      [#|2008-12-10T16:07:59.719+0100|WARNING|sun-glassfish-comms-server1.5|javax.jms|_ThreadID=11;_ThreadName=main;_RequestID=d6b40d60-5d19-48a5-a87c-b58efbf24378;|[C4003]:
      Error occurred on connection creation [localhost:37676]. - cause:
      java.net.ConnectException: Connection refused|#]

      As we tested, that port receives connections though....
      Can this be a logical error or a race condition ?

      BR

      Babbis

      1. 1531-fix.jar
        8 kB
        Yamini K B
      2. 1531-fix-2.jar
        8 kB
        Yamini K B
      3. debug-patch-jan22.jar
        11 kB
        Yamini K B
      4. FailStart_17h26m.tgz
        158 kB
        ocorbun
      5. FailStart15h05m.tgz
        290 kB
        ocorbun
      6. NA-server.log
        19 kB
        ocorbun
      7. server2.log
        113 kB
        babbisx
      8. SI-server.log
        42 kB
        ocorbun
      9. SI-server-FINEST.log
        155 kB
        ocorbun

        Activity

        Hide
        babbisx added a comment -

        Created an attachment (id=889)
        Server log for the server instance

        Show
        babbisx added a comment - Created an attachment (id=889) Server log for the server instance
        Hide
        prasads added a comment -

        RE-assign to Satish.

        Show
        prasads added a comment - RE-assign to Satish.
        Hide
        Satish Kumar added a comment -

        Can you pls attach the MQ broker logs and the domain.xml for your setup? You
        will also need to attach server.logs with the debug level for the JMS module
        set to FINEST.

        Reducing priority to P3 since it is not a stopper...

        Show
        Satish Kumar added a comment - Can you pls attach the MQ broker logs and the domain.xml for your setup? You will also need to attach server.logs with the debug level for the JMS module set to FINEST. Reducing priority to P3 since it is not a stopper...
        Hide
        Satish Kumar added a comment -

        Although there is a message in the log that seems to suggest an error in
        connecting to the broker, this does not seem to be preventing the server from
        starting up. Also, note the message is an 'INFO' message, not a 'SEVERE' error.

        Show
        Satish Kumar added a comment - Although there is a message in the log that seems to suggest an error in connecting to the broker, this does not seem to be preventing the server from starting up. Also, note the message is an 'INFO' message, not a 'SEVERE' error.
        Hide
        babbisx added a comment -

        Hi,

        This is a WARNING, not an INFO :

        [#|2008-12-10T16:07:59.719+0100|WARNING|sun-glassfish-comms-server1.5|javax.jms|_ThreadID=11;_ThreadName=main;_RequestID=d6b40d60-5d19-48a5-a87c-b58efbf24378;|[C4003]:
        Error occurred on connection creation [localhost:37676]. - cause:
        java.net.ConnectException: Connection refused|#]

        The bad thing is that is one of our most frequent errors, that prevents our big
        clusters to fully come up....
        I have some trouble due to the holidays to aquirr full logs, but i will come
        back to you with the logs you require asap.

        BR

        Babbis

        Show
        babbisx added a comment - Hi, This is a WARNING, not an INFO : [#|2008-12-10T16:07:59.719+0100|WARNING|sun-glassfish-comms-server1.5|javax.jms|_ThreadID=11;_ThreadName=main;_RequestID=d6b40d60-5d19-48a5-a87c-b58efbf24378;| [C4003] : Error occurred on connection creation [localhost:37676] . - cause: java.net.ConnectException: Connection refused|#] The bad thing is that is one of our most frequent errors, that prevents our big clusters to fully come up.... I have some trouble due to the holidays to aquirr full logs, but i will come back to you with the logs you require asap. BR Babbis
        Hide
        prasads added a comment -

        "Reclassifying as P4 because these issues are not must fix for SailFin 1.0 release.
        This issue will be scrubbed after this release and will be given the right
        priority for SailFin 2.0 release."

        Show
        prasads added a comment - "Reclassifying as P4 because these issues are not must fix for SailFin 1.0 release. This issue will be scrubbed after this release and will be given the right priority for SailFin 2.0 release."
        Hide
        ocorbun added a comment -

        I am working on the same department as Babbis.

        On my system, I can see these "Connection refused" even when the server instance
        is starting fine. So I am not sure this starting problem is related to that or JMS.

        On the other hand, the difference between a normal start and a failing start is
        that the hidden MEjbApp is correctly loaded or not. When it fails,
        ejb_container_timer_app is correctly loaded but then a lot of GMS activities
        start to happen. MEjbApp gets never loaded and out lifecycle module are never
        kicked.

        When it fails to start, our lifecycle modules are never called:
        [#|2009-01-13T14:14:58.705+0100|WARNING|sun-glassfish-comms-server1.5|javax.enterprise.system.container.ejb|_ThreadID=11;_ThreadName=main;TimerBean;_RequestID=77830286-564f-4021-9b61-de37e82890d0;|EJB5108:Unable
        to initialize EJB Timer Service. The likely cause is the database has not been
        started or the timer database table has not been created.|#]
        [#|2009-01-13T14:14:58.705+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.core.classloading|ThreadID=11;_ThreadName=main;_ejb_container_timer_app;|LDR5010:
        All ejb(s) of [__ejb_container_timer_app] loaded successfully!|#]
        [#|2009-01-13T14:15:04.040+0100|INFO|sun-glassfish-comms-server1.5|ShoalLogger|_ThreadID=14;_ThreadName=ViewWindowThread;|GMS
        View Change Received for group traffic : Members in view for (before change
        analysis) are : ...

        When it works, the lifecycle module are called after EJB initialization:
        [#|2009-01-13T14:25:21.500+0100|WARNING|sun-glassfish-comms-server1.5|javax.enterprise.system.container.ejb|_ThreadID=11;_ThreadName=main;TimerBean;_RequestID=98b22ae2-e743-42d8-b16e-3aed6e710f95;|EJB5108:Unable
        to initialize EJB Timer Service. The likely cause is the database has not been
        started or the timer database table has not been created.|#]
        [#|2009-01-13T14:25:21.501+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.core.classloading|ThreadID=11;_ThreadName=main;_ejb_container_timer_app;|LDR5010:
        All ejb(s) of [__ejb_container_timer_app] loaded successfully!|#]
        [#|2009-01-13T14:25:22.205+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.core.classloading|_ThreadID=11;_ThreadName=main;MEjbApp;|LDR5010:
        All ejb(s) of [MEjbApp] loaded successfully!|#]
        [#|2009-01-13T14:25:22.212+0100|INFO|sun-glassfish-comms-server1.5|com.ericsson.mmas.lifecycle.CAF|_ThreadID=11;_ThreadName=main;|CAF:
        Registering MBeanServer into JNDI... |#]

        Show
        ocorbun added a comment - I am working on the same department as Babbis. On my system, I can see these "Connection refused" even when the server instance is starting fine. So I am not sure this starting problem is related to that or JMS. On the other hand, the difference between a normal start and a failing start is that the hidden MEjbApp is correctly loaded or not. When it fails, ejb_container_timer_app is correctly loaded but then a lot of GMS activities start to happen. MEjbApp gets never loaded and out lifecycle module are never kicked. When it fails to start, our lifecycle modules are never called: [#|2009-01-13T14:14:58.705+0100|WARNING|sun-glassfish-comms-server1.5|javax.enterprise.system.container.ejb|_ThreadID=11;_ThreadName=main;TimerBean;_RequestID=77830286-564f-4021-9b61-de37e82890d0;|EJB5108:Unable to initialize EJB Timer Service. The likely cause is the database has not been started or the timer database table has not been created.|#] [#|2009-01-13T14:14:58.705+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.core.classloading| ThreadID=11;_ThreadName=main; _ejb_container_timer_app;|LDR5010: All ejb(s) of [__ejb_container_timer_app] loaded successfully!|#] [#|2009-01-13T14:15:04.040+0100|INFO|sun-glassfish-comms-server1.5|ShoalLogger|_ThreadID=14;_ThreadName=ViewWindowThread;|GMS View Change Received for group traffic : Members in view for (before change analysis) are : ... When it works, the lifecycle module are called after EJB initialization: [#|2009-01-13T14:25:21.500+0100|WARNING|sun-glassfish-comms-server1.5|javax.enterprise.system.container.ejb|_ThreadID=11;_ThreadName=main;TimerBean;_RequestID=98b22ae2-e743-42d8-b16e-3aed6e710f95;|EJB5108:Unable to initialize EJB Timer Service. The likely cause is the database has not been started or the timer database table has not been created.|#] [#|2009-01-13T14:25:21.501+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.core.classloading| ThreadID=11;_ThreadName=main; _ejb_container_timer_app;|LDR5010: All ejb(s) of [__ejb_container_timer_app] loaded successfully!|#] [#|2009-01-13T14:25:22.205+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.core.classloading|_ThreadID=11;_ThreadName=main;MEjbApp;|LDR5010: All ejb(s) of [MEjbApp] loaded successfully!|#] [#|2009-01-13T14:25:22.212+0100|INFO|sun-glassfish-comms-server1.5|com.ericsson.mmas.lifecycle.CAF|_ThreadID=11;_ThreadName=main;|CAF: Registering MBeanServer into JNDI... |#]
        Hide
        ocorbun added a comment -

        Created an attachment (id=931)
        Failing instance log

        Show
        ocorbun added a comment - Created an attachment (id=931) Failing instance log
        Hide
        ocorbun added a comment -

        Created an attachment (id=932)
        NA log during failing instance

        Show
        ocorbun added a comment - Created an attachment (id=932) NA log during failing instance
        Hide
        ocorbun added a comment -

        Created an attachment (id=933)
        Failing instance detailed log

        Show
        ocorbun added a comment - Created an attachment (id=933) Failing instance detailed log
        Hide
        ocorbun added a comment -

        last log with FINEST on GMS, EJB and JMS

        Show
        ocorbun added a comment - last log with FINEST on GMS, EJB and JMS
        Hide
        Satish Kumar added a comment -

        The JMS Exception in the server.log is just a warning that can safely be
        ignored. There is another message a few lines after this stating that the
        connection to the broker has been successfully established -

        [#|2009-01-16T13:11:13.832+0100|INFO|sun-glassfish-comms-
        server1.5|javax.resourceadapter.mqjmsra.lifecycle|_ThreadID=11;_ThreadName=main;

        MQJMSRA_RA1101: SJSMQ JMSRA Started:LOCAL #]

        However, there are quite a few config exceptions in the Node Agent logs that
        need to be investigated.

        Reassigning the issue to Yamini in the SailFin Admin team to investigate and
        reassign to the right person.

        Show
        Satish Kumar added a comment - The JMS Exception in the server.log is just a warning that can safely be ignored. There is another message a few lines after this stating that the connection to the broker has been successfully established - [#|2009-01-16T13:11:13.832+0100|INFO|sun-glassfish-comms- server1.5|javax.resourceadapter.mqjmsra.lifecycle|_ThreadID=11;_ThreadName=main; MQJMSRA_RA1101: SJSMQ JMSRA Started:LOCAL #] However, there are quite a few config exceptions in the Node Agent logs that need to be investigated. Reassigning the issue to Yamini in the SailFin Admin team to investigate and reassign to the right person.
        Hide
        Yamini K B added a comment -

        I looked at all the logs. Following the sequence of events based on time stamps,
        there appears to have been a synchronization failure while starting NA.

        Time 1: Something happened while synchronizing the instance domain.xml The file
        was read incorrectly (possible network error??), hence the SAX exception.

        [#|2009-01-16T13:10:26.192+0100|WARNING|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=10;_ThreadName=main;|NAGT0002:An
        exception has occurred during the sychronization of this node with the DAS.
        com.sun.enterprise.config.ConfigException: Error refreshing
        ConfigContext:/opt/sailfin-v1-b60g/nodedata/nodeagents/PL_2_3/agent/config/domain.xml
        cause: Failed to create the XML-DOM Document. Check your XML to make sure it is
        correct.
        Premature end of file.

        Time 2: (probably a second attempt of start-node-agent)

        Jan 16, 2009 1:10:42 PM com.sun.enterprise.admin.servermgmt.launch.ASLauncher
        buildCommand
        INFO:

        Time 3: Synchronization succeeds and server instance start is initiated by NA

        [#|2009-01-16T13:11:02.587+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI
        TCP Connection(11)-192.168.0.254;traffic_instance_PL_2_3;|NodeAgent
        synchronizing for instance|#]

        [#|2009-01-16T13:11:04.158+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI
        TCP
        Connection(11)-192.168.0.254;traffic_instance_PL_2_3;SUCCEEDED!;1570;|NAGT0039:Synchronization
        for traffic_instance_PL_2_3 - SUCCEEDED! and took 1570 milliseconds|#]

        [#|2009-01-16T13:11:04.158+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI
        TCP Connection(11)-192.168.0.254;|NA-DEBUG: Release Lock: traffic_instance_PL_2_3|#]

        [#|2009-01-16T13:11:04.158+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI
        TCP Connection(11)-192.168.0.254;|NA-DEBUG: Getting ready to have ProcessManger
        start the instance: traffic_instance_PL_2_3|#]

        However, there are some SEVERE messages from tomcat:

        [#|2009-01-16T16:22:41.729+0100|SEVERE|sun-glassfish-comms-server1.5|org.apache.coyote.tomcat5.CoyoteConnector|_ThreadID=19;_ThreadName=RMI
        TCP
        Connection(7)-192.168.0.254;_RequestID=9472aa4d-7411-4e5e-9db1-39d433a8af4a;|PWC3983:
        Coyote connector has not been started|#]

        [#|2009-01-16T16:22:41.898+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.clb|_ThreadID=19;_ThreadName=RMI
        TCP Connection(7)-192.168.0.254;|Sailfin_Clb_pipeline_created|#]

        [#|2009-01-16T16:22:41.952+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.web|_ThreadID=19;_ThreadName=RMI
        TCP Connection(7)-192.168.0.254;38080;|WEB0712: Starting
        Sun-Java-System/Application-Server HTTP/1.1 on 38080|#]

        [#|2009-01-16T16:22:42.475+0100|SEVERE|sun-glassfish-comms-server1.5|org.apache.coyote.tomcat5.CoyoteConnector|_ThreadID=19;_ThreadName=RMI
        TCP
        Connection(7)-192.168.0.254;_RequestID=9472aa4d-7411-4e5e-9db1-39d433a8af4a;|PWC3983:
        Coyote connector has not been started|#]

        Since there are no other exceptions/messages from tomcat, unable to guess what
        could have gone wrong. Cc'ing Jan Luehe for his comments.

        ocorbun, are the instance logs complete?

        Jan, wouldn't the coyote connector error initiate a server shutdown (no server
        shutdown messages seen)?

        Show
        Yamini K B added a comment - I looked at all the logs. Following the sequence of events based on time stamps, there appears to have been a synchronization failure while starting NA. Time 1: Something happened while synchronizing the instance domain.xml The file was read incorrectly (possible network error??), hence the SAX exception. [#|2009-01-16T13:10:26.192+0100|WARNING|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=10;_ThreadName=main;|NAGT0002:An exception has occurred during the sychronization of this node with the DAS. com.sun.enterprise.config.ConfigException: Error refreshing ConfigContext:/opt/sailfin-v1-b60g/nodedata/nodeagents/PL_2_3/agent/config/domain.xml cause: Failed to create the XML-DOM Document. Check your XML to make sure it is correct. Premature end of file. Time 2: (probably a second attempt of start-node-agent) Jan 16, 2009 1:10:42 PM com.sun.enterprise.admin.servermgmt.launch.ASLauncher buildCommand INFO: Time 3: Synchronization succeeds and server instance start is initiated by NA [#|2009-01-16T13:11:02.587+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI TCP Connection(11)-192.168.0.254;traffic_instance_PL_2_3;|NodeAgent synchronizing for instance|#] [#|2009-01-16T13:11:04.158+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI TCP Connection(11)-192.168.0.254;traffic_instance_PL_2_3;SUCCEEDED!;1570;|NAGT0039:Synchronization for traffic_instance_PL_2_3 - SUCCEEDED! and took 1570 milliseconds|#] [#|2009-01-16T13:11:04.158+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI TCP Connection(11)-192.168.0.254;|NA-DEBUG: Release Lock: traffic_instance_PL_2_3|#] [#|2009-01-16T13:11:04.158+0100|INFO|sun-glassfish-comms-server1.5|javax.ee.enterprise.system.nodeagent|_ThreadID=11;_ThreadName=RMI TCP Connection(11)-192.168.0.254;|NA-DEBUG: Getting ready to have ProcessManger start the instance: traffic_instance_PL_2_3|#] However, there are some SEVERE messages from tomcat: [#|2009-01-16T16:22:41.729+0100|SEVERE|sun-glassfish-comms-server1.5|org.apache.coyote.tomcat5.CoyoteConnector|_ThreadID=19;_ThreadName=RMI TCP Connection(7)-192.168.0.254;_RequestID=9472aa4d-7411-4e5e-9db1-39d433a8af4a;|PWC3983: Coyote connector has not been started|#] [#|2009-01-16T16:22:41.898+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.clb|_ThreadID=19;_ThreadName=RMI TCP Connection(7)-192.168.0.254;|Sailfin_Clb_pipeline_created|#] [#|2009-01-16T16:22:41.952+0100|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.web|_ThreadID=19;_ThreadName=RMI TCP Connection(7)-192.168.0.254;38080;|WEB0712: Starting Sun-Java-System/Application-Server HTTP/1.1 on 38080|#] [#|2009-01-16T16:22:42.475+0100|SEVERE|sun-glassfish-comms-server1.5|org.apache.coyote.tomcat5.CoyoteConnector|_ThreadID=19;_ThreadName=RMI TCP Connection(7)-192.168.0.254;_RequestID=9472aa4d-7411-4e5e-9db1-39d433a8af4a;|PWC3983: Coyote connector has not been started|#] Since there are no other exceptions/messages from tomcat, unable to guess what could have gone wrong. Cc'ing Jan Luehe for his comments. ocorbun, are the instance logs complete? Jan, wouldn't the coyote connector error initiate a server shutdown (no server shutdown messages seen)?
        Hide
        ocorbun added a comment -

        This cluster is run on top of SAF. Since Node Agents and Server Instances are
        started by the Availability Management Framework (AMF), if a server fails to
        start or takes too long time (not answering first AMF healthcheck), the log is
        saved (attached in this issue), the instance is killed and deleted. A new server
        instance is created via asadmin and started.
        This is maybe why you do not see any shutdown message in the attached log.
        Otherwise, yes it is a complete server instance log from start to hanging start.

        Olivier
        ERICSSON ///

        Show
        ocorbun added a comment - This cluster is run on top of SAF. Since Node Agents and Server Instances are started by the Availability Management Framework (AMF), if a server fails to start or takes too long time (not answering first AMF healthcheck), the log is saved (attached in this issue), the instance is killed and deleted. A new server instance is created via asadmin and started. This is maybe why you do not see any shutdown message in the attached log. Otherwise, yes it is a complete server instance log from start to hanging start. Olivier ERICSSON ///
        Hide
        Yamini K B added a comment -

        Thanks for the input Olivier. Some more questions:
        1. Do you see the NA startup error (synchronization error) often?
        2. How long does AMF wait before deciding that server startup failed?

        Show
        Yamini K B added a comment - Thanks for the input Olivier. Some more questions: 1. Do you see the NA startup error (synchronization error) often? 2. How long does AMF wait before deciding that server startup failed?
        Hide
        ocorbun added a comment -

        > 1. Do you see the NA startup error (synchronization error) often?
        Not really. I will check if I can see it when it starts properly or if it occurs
        every time it fails. I need to do more testing.

        > 2. How long does AMF wait before deciding that server startup failed?
        AMF waits 15 minutes in total including NA start/sync and SI start. We have a
        lifecycle module in the SI that can answer AMF healthcheck. But when it fails
        the lifecycle module never get the init event.

        Show
        ocorbun added a comment - > 1. Do you see the NA startup error (synchronization error) often? Not really. I will check if I can see it when it starts properly or if it occurs every time it fails. I need to do more testing. > 2. How long does AMF wait before deciding that server startup failed? AMF waits 15 minutes in total including NA start/sync and SI start. We have a lifecycle module in the SI that can answer AMF healthcheck. But when it fails the lifecycle module never get the init event.
        Hide
        ocorbun added a comment -

        > > 1. Do you see the NA startup error (synchronization error) often?
        > Not really. I will check if I can see it when it starts properly
        > or if it occurs every time it fails. I need to do more testing.

        Yes the synchronization problem can occur even when the instance starts
        normally. But it is rare. And we can also see that in case the instance fails to
        start within 15 minutes, this problem is also rarely seen. So you can rule it
        out for this particular issue.

        Show
        ocorbun added a comment - > > 1. Do you see the NA startup error (synchronization error) often? > Not really. I will check if I can see it when it starts properly > or if it occurs every time it fails. I need to do more testing. Yes the synchronization problem can occur even when the instance starts normally. But it is rare. And we can also see that in case the instance fails to start within 15 minutes, this problem is also rarely seen. So you can rule it out for this particular issue.
        Hide
        jluehe added a comment -

        This error that was seen in the log:

        SEVERE|sun-glassfish-comms-server1.5|
        org.apache.coyote.tomcat5.CoyoteConnector|
        ...|PWC3983: Coyote connector has not been started|#]

        is printed by this code:

        public void stop() throws LifecycleException {

        // Validate and update our current state
        if (!started)

        { log.error(sm.getString("coyoteConnector.notStarted")); return; }

        inside

        appserv-webtier/src/java/org/apache/coyote/tomcat5/CoyoteConnector#stop

        For some reason, the HTTP listener is being asked to stop, before it has even
        been started!

        This normally indicates that an error has occurred during the instance startup,
        which causes all containers (including the web container) to shut down. When the
        web container shuts down, it shuts down all its HTTP listeners.

        Would it be possible to prepare a patch that would dump the thread stack after
        printing the error message, and before returning, like this:

        public void stop() throws LifecycleException {

        // Validate and update our current state
        if (!started)

        { log.error(sm.getString("coyoteConnector.notStarted")); Thread.currentThread().dumpStack(); // <---- NEW return; }

        That would tell us what's triggering the stop().

        Show
        jluehe added a comment - This error that was seen in the log: SEVERE|sun-glassfish-comms-server1.5| org.apache.coyote.tomcat5.CoyoteConnector| ...|PWC3983: Coyote connector has not been started|#] is printed by this code: public void stop() throws LifecycleException { // Validate and update our current state if (!started) { log.error(sm.getString("coyoteConnector.notStarted")); return; } inside appserv-webtier/src/java/org/apache/coyote/tomcat5/CoyoteConnector#stop For some reason, the HTTP listener is being asked to stop, before it has even been started! This normally indicates that an error has occurred during the instance startup, which causes all containers (including the web container) to shut down. When the web container shuts down, it shuts down all its HTTP listeners. Would it be possible to prepare a patch that would dump the thread stack after printing the error message, and before returning, like this: public void stop() throws LifecycleException { // Validate and update our current state if (!started) { log.error(sm.getString("coyoteConnector.notStarted")); Thread.currentThread().dumpStack(); // <---- NEW return; } That would tell us what's triggering the stop().
        Hide
        Yamini K B added a comment -

        Thanks for your comments Jan, I think its the SAF notification thats initiating
        the web container shutdown.

        Olivier, I'm attaching a debug patch for CoyoteConnector, would appreciate if
        you can try it out. We can confirm the reason for the SEVERE message. Please
        include the jar in the classpath-prefix of the cluster config and ensure that
        the jar is accessible by the instances.

        Found something interesting in the log messages:

        [#|2009-01-16T*13:11:21.552+0100*|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.clb|_ThreadID=17;_ThreadName=pool-2-thread-2;traffic;|CLBC0129:
        CLB and GMS view for cluster traffic are in sync.|#]

        [#|2009-01-16T*13:24:35.110+0100*|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.clb|_ThreadID=18;_ThreadName=RMI
        TCP Connection(7)-192.168.0.254;server;|Start: Synchronizing CLB XML file from
        DAS. DAS NAME: server|#]

        Notice the time difference (13min) between the above messages. The second
        instane log also shows similar difference.

        Olivier, can you please turn on FINEST mode for clb and resend the instance log?

        Show
        Yamini K B added a comment - Thanks for your comments Jan, I think its the SAF notification thats initiating the web container shutdown. Olivier, I'm attaching a debug patch for CoyoteConnector, would appreciate if you can try it out. We can confirm the reason for the SEVERE message. Please include the jar in the classpath-prefix of the cluster config and ensure that the jar is accessible by the instances. Found something interesting in the log messages: [#|2009-01-16T*13:11:21.552+0100*|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.clb|_ThreadID=17;_ThreadName=pool-2-thread-2;traffic;|CLBC0129: CLB and GMS view for cluster traffic are in sync.|#] [#|2009-01-16T*13:24:35.110+0100*|INFO|sun-glassfish-comms-server1.5|javax.enterprise.system.container.clb|_ThreadID=18;_ThreadName=RMI TCP Connection(7)-192.168.0.254;server;|Start: Synchronizing CLB XML file from DAS. DAS NAME: server|#] Notice the time difference (13min) between the above messages. The second instane log also shows similar difference. Olivier, can you please turn on FINEST mode for clb and resend the instance log?
        Hide
        Yamini K B added a comment -

        Created an attachment (id=942)
        CoyoteConnector debug patch

        Show
        Yamini K B added a comment - Created an attachment (id=942) CoyoteConnector debug patch
        Hide
        Yamini K B added a comment -

        One more suggestion, would it be possible to take stack dump (using jstack) on
        the failing instance during the time gap?

        Show
        Yamini K B added a comment - One more suggestion, would it be possible to take stack dump (using jstack) on the failing instance during the time gap?
        Hide
        Yamini K B added a comment -

        I happened to notice that JVM is being tuned as follows:

        -server
        -XX:+UnlockDiagnosticVMOptions
        -XX:MaxPermSize=192m
        -Xmx512m
        -XX:NewRatio=2
        -Xms3000m
        -Xmx3000m
        -Xss128k
        -XX:SurvivorRatio=128
        -XX:+DisableExplicitGC
        -XX:PermSize=64m
        -XX:MaxPermSize=128m
        -XX:+UseParNewGC
        -XX:+UseConcMarkSweepGC
        -XX:LargePageSizeInBytes=2m
        -XX:ParallelGCThreads=7
        -Xmn128m
        -XX:MaxTenuringThreshold=8
        -XX:CMSInitiatingOccupancyFraction=25
        -XX:+LogVMOutput
        -XX:LogFile=/opt/sailfin-v1-b60g/nodedata/nodeagents/PL_2_4/traffic_instance_PL_2_4/logs/jvm.log

        Some thoughts:
        1. Please remove the duplicate settings (-Xmx, MaxPermSize)
        2. Generally young generation size (-Xmn) is tuned to 1/4 of total heap.
        3. Can you try using the -verbosegc -XX:+PrintGCDetails to see if VM is causing
        any pauses during the startup?

        Show
        Yamini K B added a comment - I happened to notice that JVM is being tuned as follows: -server -XX:+UnlockDiagnosticVMOptions -XX:MaxPermSize=192m -Xmx512m -XX:NewRatio=2 -Xms3000m -Xmx3000m -Xss128k -XX:SurvivorRatio=128 -XX:+DisableExplicitGC -XX:PermSize=64m -XX:MaxPermSize=128m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:LargePageSizeInBytes=2m -XX:ParallelGCThreads=7 -Xmn128m -XX:MaxTenuringThreshold=8 -XX:CMSInitiatingOccupancyFraction=25 -XX:+LogVMOutput -XX:LogFile=/opt/sailfin-v1-b60g/nodedata/nodeagents/PL_2_4/traffic_instance_PL_2_4/logs/jvm.log Some thoughts: 1. Please remove the duplicate settings (-Xmx, MaxPermSize) 2. Generally young generation size (-Xmn) is tuned to 1/4 of total heap. 3. Can you try using the -verbosegc -XX:+PrintGCDetails to see if VM is causing any pauses during the startup?
        Hide
        ocorbun added a comment -

        Created an attachment (id=944)
        Logs with patch prefixed

        Show
        ocorbun added a comment - Created an attachment (id=944) Logs with patch prefixed
        Hide
        ocorbun added a comment -

        Last log attached.

        This includes:

        • patch prefixed (Coyote stack visible)
        • added VM args: -verbosegc -XX:+PrintGCDetails
        • removed duplicated VM args
        • FINEST on clb and ejb

        jstack did not work on the hanging start process.

        Show
        ocorbun added a comment - Last log attached. This includes: patch prefixed (Coyote stack visible) added VM args: -verbosegc -XX:+PrintGCDetails removed duplicated VM args FINEST on clb and ejb jstack did not work on the hanging start process.
        Hide
        Yamini K B added a comment -

        Changed the summary.

        Analyzing the stack trace, there appears to be a deadlock happening in ORB's
        FOLB code:

        "pool-6-thread-2" prio=10 tid=0x00002aab799fd000 nid=0xb5a in
        Object.wait() [0x0000000042524000..0x0000000042524b20]
        java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)

        • waiting on <0x00002aaab92081b0> (a
          com.sun.corba.ee.impl.folb.ServerGroupManager$1)
          at java.lang.Thread.join(Thread.java:1143)
        • locked <0x00002aaab92081b0> (a
          com.sun.corba.ee.impl.folb.ServerGroupManager$1)
          at java.lang.Thread.join(Thread.java:1196)
          at
          com.sun.corba.ee.impl.folb.ServerGroupManager.restartFactories(ServerGroupManager.java:494)
          at
          com.sun.corba.ee.impl.folb.ServerGroupManager.membershipChange(ServerGroupManager.java:391)
          at
          com.sun.corba.ee.spi.folb.GroupInfoServiceBase.notifyObservers(GroupInfoServiceBase.java:76)
          at
          com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.addMember(IiopFolbGmsClient.java:396)
        • locked <0x00002aaab90be688> (a java.util.HashMap)
          at
          com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.handleSignal(IiopFolbGmsClient.java:310)
          at
          com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.consumeSignal(IiopFolbGmsClient.java:178)
          at
          com.sun.enterprise.ee.cms.impl.common.Router$CallableAction.call(Router.java:509)
          at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          at
          java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
          at java.lang.Thread.run(Thread.java:619)

        This issue will need to be assigned to the right owner.

        Show
        Yamini K B added a comment - Changed the summary. Analyzing the stack trace, there appears to be a deadlock happening in ORB's FOLB code: "pool-6-thread-2" prio=10 tid=0x00002aab799fd000 nid=0xb5a in Object.wait() [0x0000000042524000..0x0000000042524b20] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) waiting on <0x00002aaab92081b0> (a com.sun.corba.ee.impl.folb.ServerGroupManager$1) at java.lang.Thread.join(Thread.java:1143) locked <0x00002aaab92081b0> (a com.sun.corba.ee.impl.folb.ServerGroupManager$1) at java.lang.Thread.join(Thread.java:1196) at com.sun.corba.ee.impl.folb.ServerGroupManager.restartFactories(ServerGroupManager.java:494) at com.sun.corba.ee.impl.folb.ServerGroupManager.membershipChange(ServerGroupManager.java:391) at com.sun.corba.ee.spi.folb.GroupInfoServiceBase.notifyObservers(GroupInfoServiceBase.java:76) at com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.addMember(IiopFolbGmsClient.java:396) locked <0x00002aaab90be688> (a java.util.HashMap) at com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.handleSignal(IiopFolbGmsClient.java:310) at com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.consumeSignal(IiopFolbGmsClient.java:178) at com.sun.enterprise.ee.cms.impl.common.Router$CallableAction.call(Router.java:509) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) This issue will need to be assigned to the right owner.
        Hide
        Yamini K B added a comment -

        The main thread which is waiting for the lock:

        "main" prio=10 tid=0x0000000040118000 nid=0xa8b waiting for monitor
        entry [0x0000000040027000..0x0000000040028f60]
        java.lang.Thread.State: BLOCKED (on object monitor)
        at
        com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient$GIS.getClusterInstanceInfo(IiopFolbGmsClient.java:215)

        • waiting to lock <0x00002aaab90be688> (a java.util.HashMap)
          at
          com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.getClusterInstanceInfo(IiopFolbGmsClient.java:276)
          at
          com.sun.corba.ee.impl.folb.ServerGroupManager.establish_components(ServerGroupManager.java:242)
          at
          com.sun.corba.ee.impl.interceptors.InterceptorInvoker.objectAdapterCreated(InterceptorInvoker.java:145)
          at
          com.sun.corba.ee.impl.interceptors.PIHandlerImpl.objectAdapterCreated(PIHandlerImpl.java:297)
          at
          com.sun.corba.ee.spi.oa.ObjectAdapterBase.initializeTemplate(ObjectAdapterBase.java:139)
          at com.sun.corba.ee.impl.oa.poa.POAImpl.initialize(POAImpl.java:415)
          at com.sun.corba.ee.impl.oa.poa.POAImpl.create_POA(POAImpl.java:815)
          at
          com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl$AdapterActivatorImpl.unknown_adapter(ReferenceFactoryManagerImpl.java:132)
          at com.sun.corba.ee.impl.oa.poa.POAImpl.find_POA(POAImpl.java:928)
        • locked <0x00002aaab90abb58> (a
          com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl$AdapterActivatorImpl)
          at
          com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl.createReference(ReferenceFactoryManagerImpl.java:483)
          at
          com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryImpl.createReference(ReferenceFactoryImpl.java:58)
          at
          com.sun.enterprise.iiop.POARemoteReferenceFactory._createRef(POARemoteReferenceFactory.java:412)
          at
          com.sun.enterprise.iiop.POARemoteReferenceFactory.createRef(POARemoteReferenceFactory.java:383)
          at
          com.sun.enterprise.iiop.POARemoteReferenceFactory.createHomeReference(POARemoteReferenceFactory.java:374)
          at com.sun.ejb.containers.BaseContainer.initializeHome(BaseContainer.java:933)
          at
          com.sun.ejb.containers.StatelessSessionContainer.initializeHome(StatelessSessionContainer.java:232)
          at
          com.sun.ejb.containers.ContainerFactoryImpl.createContainer(ContainerFactoryImpl.java:345)
          at com.sun.enterprise.server.AbstractLoader.loadEjbs(AbstractLoader.java:527)
          at com.sun.enterprise.server.ApplicationLoader.doLoad(ApplicationLoader.java:191)
          at
          com.sun.enterprise.server.TomcatApplicationLoader.doLoad(TomcatApplicationLoader.java:126)
          at
          com.sun.enterprise.server.ExtendedApplicationLoader.doLoad(ExtendedApplicationLoader.java:134)
          at com.sun.enterprise.server.AbstractLoader.load(AbstractLoader.java:238)
          at
          com.sun.enterprise.server.AbstractManager.loadOneSystemApp(AbstractManager.java:407)
          at
          com.sun.enterprise.server.AbstractManager$SystemAppStarter.doRun(AbstractManager.java:664)
          at com.sun.appserv.management.util.misc.RunnableBase.runSync(RunnableBase.java:304)
          at com.sun.appserv.management.util.misc.RunnableBase._submit(RunnableBase.java:176)
          at com.sun.appserv.management.util.misc.RunnableBase.submit(RunnableBase.java:210)
          at com.sun.enterprise.server.AbstractManager.loadSystem(AbstractManager.java:331)
          at
          com.sun.enterprise.server.SystemAppLifecycle.loadSystemApps(SystemAppLifecycle.java:162)
          at
          com.sun.enterprise.server.SystemAppLifecycle.onStartup(SystemAppLifecycle.java:108)
          at
          com.sun.enterprise.server.ApplicationServer.onStartup(ApplicationServer.java:446)
          at
          com.sun.enterprise.server.ondemand.OnDemandServer.onStartup(OnDemandServer.java:134)
          at com.sun.enterprise.server.PEMain.run(PEMain.java:409)
          at com.sun.enterprise.server.PEMain.main(PEMain.java:336)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          at
          sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          at java.lang.reflect.Method.invoke(Method.java:597)
          at com.sun.enterprise.server.PELaunch.main(PELaunch.java:415)
        Show
        Yamini K B added a comment - The main thread which is waiting for the lock: "main" prio=10 tid=0x0000000040118000 nid=0xa8b waiting for monitor entry [0x0000000040027000..0x0000000040028f60] java.lang.Thread.State: BLOCKED (on object monitor) at com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient$GIS.getClusterInstanceInfo(IiopFolbGmsClient.java:215) waiting to lock <0x00002aaab90be688> (a java.util.HashMap) at com.sun.enterprise.ee.ejb.iiop.IiopFolbGmsClient.getClusterInstanceInfo(IiopFolbGmsClient.java:276) at com.sun.corba.ee.impl.folb.ServerGroupManager.establish_components(ServerGroupManager.java:242) at com.sun.corba.ee.impl.interceptors.InterceptorInvoker.objectAdapterCreated(InterceptorInvoker.java:145) at com.sun.corba.ee.impl.interceptors.PIHandlerImpl.objectAdapterCreated(PIHandlerImpl.java:297) at com.sun.corba.ee.spi.oa.ObjectAdapterBase.initializeTemplate(ObjectAdapterBase.java:139) at com.sun.corba.ee.impl.oa.poa.POAImpl.initialize(POAImpl.java:415) at com.sun.corba.ee.impl.oa.poa.POAImpl.create_POA(POAImpl.java:815) at com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl$AdapterActivatorImpl.unknown_adapter(ReferenceFactoryManagerImpl.java:132) at com.sun.corba.ee.impl.oa.poa.POAImpl.find_POA(POAImpl.java:928) locked <0x00002aaab90abb58> (a com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl$AdapterActivatorImpl) at com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl.createReference(ReferenceFactoryManagerImpl.java:483) at com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryImpl.createReference(ReferenceFactoryImpl.java:58) at com.sun.enterprise.iiop.POARemoteReferenceFactory._createRef(POARemoteReferenceFactory.java:412) at com.sun.enterprise.iiop.POARemoteReferenceFactory.createRef(POARemoteReferenceFactory.java:383) at com.sun.enterprise.iiop.POARemoteReferenceFactory.createHomeReference(POARemoteReferenceFactory.java:374) at com.sun.ejb.containers.BaseContainer.initializeHome(BaseContainer.java:933) at com.sun.ejb.containers.StatelessSessionContainer.initializeHome(StatelessSessionContainer.java:232) at com.sun.ejb.containers.ContainerFactoryImpl.createContainer(ContainerFactoryImpl.java:345) at com.sun.enterprise.server.AbstractLoader.loadEjbs(AbstractLoader.java:527) at com.sun.enterprise.server.ApplicationLoader.doLoad(ApplicationLoader.java:191) at com.sun.enterprise.server.TomcatApplicationLoader.doLoad(TomcatApplicationLoader.java:126) at com.sun.enterprise.server.ExtendedApplicationLoader.doLoad(ExtendedApplicationLoader.java:134) at com.sun.enterprise.server.AbstractLoader.load(AbstractLoader.java:238) at com.sun.enterprise.server.AbstractManager.loadOneSystemApp(AbstractManager.java:407) at com.sun.enterprise.server.AbstractManager$SystemAppStarter.doRun(AbstractManager.java:664) at com.sun.appserv.management.util.misc.RunnableBase.runSync(RunnableBase.java:304) at com.sun.appserv.management.util.misc.RunnableBase._submit(RunnableBase.java:176) at com.sun.appserv.management.util.misc.RunnableBase.submit(RunnableBase.java:210) at com.sun.enterprise.server.AbstractManager.loadSystem(AbstractManager.java:331) at com.sun.enterprise.server.SystemAppLifecycle.loadSystemApps(SystemAppLifecycle.java:162) at com.sun.enterprise.server.SystemAppLifecycle.onStartup(SystemAppLifecycle.java:108) at com.sun.enterprise.server.ApplicationServer.onStartup(ApplicationServer.java:446) at com.sun.enterprise.server.ondemand.OnDemandServer.onStartup(OnDemandServer.java:134) at com.sun.enterprise.server.PEMain.run(PEMain.java:409) at com.sun.enterprise.server.PEMain.main(PEMain.java:336) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.enterprise.server.PELaunch.main(PELaunch.java:415)
        Hide
        skgaju added a comment -

        adding myself to cc list

        Show
        skgaju added a comment - adding myself to cc list
        Hide
        Yamini K B added a comment -

        Attaching the fix for hang, please include it in classpath-prefix of cluster
        config and ensure that the jar is accessible by all instances.

        Show
        Yamini K B added a comment - Attaching the fix for hang, please include it in classpath-prefix of cluster config and ensure that the jar is accessible by all instances.
        Hide
        Yamini K B added a comment -

        Created an attachment (id=947)
        Fix for hang

        Show
        Yamini K B added a comment - Created an attachment (id=947) Fix for hang
        Hide
        ocorbun added a comment -

        Created an attachment (id=948)
        It fails with the fix as well. Coyote error not in this log but we have seen it even with the fix. Here we can see again that MEjbApp is not loaded.

        Show
        ocorbun added a comment - Created an attachment (id=948) It fails with the fix as well. Coyote error not in this log but we have seen it even with the fix. Here we can see again that MEjbApp is not loaded.
        Hide
        Yamini K B added a comment -

        Olivier,

        I'm sorry, I did a mistake in packaging the patch (though I did have the proper
        patch in my workspace while running sanity tests). Uploading the proper patch,
        please try with this.

        Show
        Yamini K B added a comment - Olivier, I'm sorry, I did a mistake in packaging the patch (though I did have the proper patch in my workspace while running sanity tests). Uploading the proper patch, please try with this.
        Hide
        Yamini K B added a comment -

        Created an attachment (id=949)
        Re-packaged the fix

        Show
        Yamini K B added a comment - Created an attachment (id=949) Re-packaged the fix
        Hide
        ocorbun added a comment -

        With the patch, I was not able to reproduce the problem on a 4 blades system.
        We will try tomorrow on a 20 blades system.

        Olivier
        ERICSSON ///

        Show
        ocorbun added a comment - With the patch, I was not able to reproduce the problem on a 4 blades system. We will try tomorrow on a 20 blades system. Olivier ERICSSON ///
        Hide
        Joe Fialli added a comment -

        Adding myself to CC list as someone who could investigate the IiopFolbGmsClient
        jstack stack traces attached to this issue. I would be interested in full jstack
        output. Currently, only a summary is available in descriptions of this issue.

        Show
        Joe Fialli added a comment - Adding myself to CC list as someone who could investigate the IiopFolbGmsClient jstack stack traces attached to this issue. I would be interested in full jstack output. Currently, only a summary is available in descriptions of this issue.
        Hide
        ocorbun added a comment -

        After testing one day on a 20 blades system (plus 2 other smaller systems), we
        can say that the problem is gone with this fix. Thanks!

        Please, explain what this fix does.

        Show
        ocorbun added a comment - After testing one day on a 20 blades system (plus 2 other smaller systems), we can say that the problem is gone with this fix. Thanks! Please, explain what this fix does.
        Hide
        Yamini K B added a comment -

        The hang is due to a synchronization problem in IIopFolbGmsClient code. It is
        holding a lock on currentMembers when adding a member and calls
        notifyObservers() which lands in waiting for thread join to avoid another
        deadlock. At the same time another request thread is calling
        getClusterInstanceInfo() and waiting for the currentMembers lock. The fix is to
        move the notifyObservers() call out of the synchronized block in both
        addMember() and removeMember().

        Show
        Yamini K B added a comment - The hang is due to a synchronization problem in IIopFolbGmsClient code. It is holding a lock on currentMembers when adding a member and calls notifyObservers() which lands in waiting for thread join to avoid another deadlock. At the same time another request thread is calling getClusterInstanceInfo() and waiting for the currentMembers lock. The fix is to move the notifyObservers() call out of the synchronized block in both addMember() and removeMember().
        Hide
        Yamini K B added a comment -

        Fix checked in on FCS branch:

        Checking in ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java;
        /cvs/glassfish/appserv-core-ee/ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java,v
        <-- IiopFolbGmsClient.java
        new revision: 1.2.6.3.2.1; previous revision: 1.2.6.3
        done

        Show
        Yamini K B added a comment - Fix checked in on FCS branch: Checking in ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java; /cvs/glassfish/appserv-core-ee/ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java,v <-- IiopFolbGmsClient.java new revision: 1.2.6.3.2.1; previous revision: 1.2.6.3 done
        Hide
        Yamini K B added a comment -

        Fix checked in on truck:

        Checking in ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java;
        /cvs/glassfish/appserv-core-ee/ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java,v
        <-- IiopFolbGmsClient.java
        new revision: 1.2.6.4; previous revision: 1.2.6.3
        done

        Show
        Yamini K B added a comment - Fix checked in on truck: Checking in ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java; /cvs/glassfish/appserv-core-ee/ejb/src/java/com/sun/ejb/ee/iiop/IiopFolbGmsClient.java,v <-- IiopFolbGmsClient.java new revision: 1.2.6.4; previous revision: 1.2.6.3 done
        Hide
        adrian77 added a comment -

        Fix works.

        Show
        adrian77 added a comment - Fix works.

          People

          • Assignee:
            Yamini K B
            Reporter:
            babbisx
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: