glassfish
  1. glassfish
  2. GLASSFISH-15717

"Very Intermittent: Drop of Planned Shutdown notification of DAS (a spectator) to one of the clustered instances".

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Cannot Reproduce
    • Affects Version/s: 3.1_b38
    • Fix Version/s: 3.1.2_b06
    • Labels:
      None
    • Environment:

      linux

      Description

      This is a very intermittent drop of das planned shutdown notification seen in scenarios 10 and 11.
      http://aras2.us.oracle.com:8080/logs/gf31/gms/set_01_21_11_t_08_03_25/final_Fri_Jan_21_14_44_57_PST_2011.html

      http://aras2.us.oracle.com:8080/logs/gf31/gms//set_01_11_11_t_13_45_23/scenario_0010_Tue_Jan_11_23_55_27_PST_2011.html

      The failed constraint was the Planned Shutdown for the DAS was not received by one of the clustered instances.
      (Scenario 10 explicitly stops DAS in middle of scenario to verify GroupLeadership change.)
      This failure only happened in one out of 32 runs and for only one instance in the cluster. So it is definitely quite intermittent.

      There is a strong possibility that this was a dropped UDP message. While I have fixed dropped UDP broadcast messages
      in this release, this is unfortunately a boundary case that I can not address with current design, the rebroadcast of the missed event
      can not take place since the last event the DAS broadcast was it shutdown. So when the clustered instance noticed
      it missed an event, the instance it would request to rebroadcast the missed event no longer exist so it can not rebroadcast
      the dropped UDP packet. So this would be nontrivial
      to fix and not advised to attempt at this late stage of the release.

      Luckily, the DAS is not part of replicating data so this missed PlannedShutdown of a SPECTATOR member would not impact HA.
      There is no application that I am aware of that is dependent on planned shutdown notification of the SPECTATOR das. Everything else is okay in the logs.
      The instance was notified of a new GroupLeader to replace the Shutdown DAS and the list of current alive and ready members is correct.
      (reflects the DAS "server" is no longer part of cluster)

      Extracted from http://aras2.us.oracle.com:8080/logs/gf31/gms///set_01_11_11_t_13_45_23/scenario_0010_Tue_Jan_11_23_55_27_PST_2011/easqezorro8_n1c1m7.log

      [#|2011-01-12T07:56:38.260+0000|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1093: adding GroupLeadershipNotification signal leadermember: n1c1m1 of group: clusterz1|#]

      [#|2011-01-12T07:56:38.260+0000|INFO|glassfish3.1|ShoalLogger|_ThreadID=16;_ThreadName=Thread-1;|GMS1092: GMS View Change Received for group: clusterz1 : Members in view for MASTER_CHANGE_EVENT(before change analysis) are :
      1: MemberId: n1c1m1, MemberType: CORE, Address: 10.133.184.208:9132:228.9.53.86:31524:clusterz1:n1c1m1
      2: MemberId: n1c1m2, MemberType: CORE, Address: 10.133.184.209:9154:228.9.53.86:31524:clusterz1:n1c1m2
      3: MemberId: n1c1m3, MemberType: CORE, Address: 10.133.184.211:9140:228.9.53.86:31524:clusterz1:n1c1m3
      4: MemberId: n1c1m4, MemberType: CORE, Address: 10.133.184.213:9196:228.9.53.86:31524:clusterz1:n1c1m4
      5: MemberId: n1c1m5, MemberType: CORE, Address: 10.133.184.214:9147:228.9.53.86:31524:clusterz1:n1c1m5
      6: MemberId: n1c1m6, MemberType: CORE, Address: 10.133.184.137:9195:228.9.53.86:31524:clusterz1:n1c1m6
      7: MemberId: n1c1m7, MemberType: CORE, Address: 10.133.184.138:9121:228.9.53.86:31524:clusterz1:n1c1m7
      8: MemberId: n1c1m8, MemberType: CORE, Address: 10.133.184.139:9194:228.9.53.86:31524:clusterz1:n1c1m8
      9: MemberId: n1c1m9, MemberType: CORE, Address: 10.133.184.140:9191:228.9.53.86:31524:clusterz1:n1c1m9

      #]

        Activity

        Hide
        Joe Fialli added a comment -

        I confirmed that there were UDP drops on the machine that has the missing PlannedShutDown notification.

        % netstat -su

        Udp:
        19870588 packets received
        97130 packets to unknown port received.
        1 packet receive errors
        506777 packets sent

        I checked another machine and it had two UDP receive errors.
        I did verify that the /etc/sysctl.conf had appropriate settings for
        receive buffer. (So the OEL OS are configured as we have requested in
        past.)

        This failure can only happen in either Shoal GMS QE Scenario 10 or 11
        and it has only ever happened on machine running n1c1m7 (easqezorro8).

        The recreation rate at the time this issue was submitted was twice in 104 runs.

        It is quite possible to tune away the UDP drops by increasing the UDP receive buffer and write buffer sizes
        from current size to a little bigger. If increasing these values makes the failure go away and we do not observe
        udp packet receive errors in "netstat -su", then we would have confirmed the hypothesis that this drop is
        due to UDP drop. As I mentioned in my previous attached email, there is a boundary condition in current design
        that does not allow for rebroadcast of a a dropped planned shutdown since the rebroadcast logic is solely
        in the master which has shutdown in this case.

        The following document describes how to check and set udp buffer sizes for various OS.
        http://www.29west.com/docs/THPM/udp-buffer-sizing.html

        An unconfirmed workaround for this issue is to tune the systems current udp buffer sizing by
        increasing its value. It would be helpful if we could validate with exiting GMS QE scenario 10 and
        11 testing if this workaround does address the failure that has been reported.

        Given that the current udp read/write buffer size is 512 * 1024, we could increase it to 756 * 1024 to see if that
        causes the issue to go away on easqezorro8 machine.

        Show
        Joe Fialli added a comment - I confirmed that there were UDP drops on the machine that has the missing PlannedShutDown notification. % netstat -su Udp: 19870588 packets received 97130 packets to unknown port received. 1 packet receive errors 506777 packets sent I checked another machine and it had two UDP receive errors. I did verify that the /etc/sysctl.conf had appropriate settings for receive buffer. (So the OEL OS are configured as we have requested in past.) This failure can only happen in either Shoal GMS QE Scenario 10 or 11 and it has only ever happened on machine running n1c1m7 (easqezorro8). The recreation rate at the time this issue was submitted was twice in 104 runs. It is quite possible to tune away the UDP drops by increasing the UDP receive buffer and write buffer sizes from current size to a little bigger. If increasing these values makes the failure go away and we do not observe udp packet receive errors in "netstat -su", then we would have confirmed the hypothesis that this drop is due to UDP drop. As I mentioned in my previous attached email, there is a boundary condition in current design that does not allow for rebroadcast of a a dropped planned shutdown since the rebroadcast logic is solely in the master which has shutdown in this case. The following document describes how to check and set udp buffer sizes for various OS. http://www.29west.com/docs/THPM/udp-buffer-sizing.html An unconfirmed workaround for this issue is to tune the systems current udp buffer sizing by increasing its value. It would be helpful if we could validate with exiting GMS QE scenario 10 and 11 testing if this workaround does address the failure that has been reported. Given that the current udp read/write buffer size is 512 * 1024, we could increase it to 756 * 1024 to see if that causes the issue to go away on easqezorro8 machine.
        Hide
        Joe Fialli added a comment -

        Minimally will investigate if proposed workaround mitigates this very intermittent issue.

        Show
        Joe Fialli added a comment - Minimally will investigate if proposed workaround mitigates this very intermittent issue.
        Hide
        Joe Fialli added a comment -

        This failure has not been reported in recent glassfish gms qe test runs so closing with a cannot reproduce for time being.

        Show
        Joe Fialli added a comment - This failure has not been reported in recent glassfish gms qe test runs so closing with a cannot reproduce for time being.
        Hide
        zorro added a comment -

        Confirming that this issue is not being seen in b4 and b5 of version 3.1.2

        Show
        zorro added a comment - Confirming that this issue is not being seen in b4 and b5 of version 3.1.2

          People

          • Assignee:
            Joe Fialli
            Reporter:
            zorro
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: