sailfin
  1. sailfin
  2. SAILFIN-1967

[RN] NFS Port 2049 should not be used as heartbeat port - failover fails

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.0
    • Fix Version/s: milestone 1
    • Component/s: doc
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

    • Issuezilla Id:
      1,967
    • Status Whiteboard:
      Hide

      RN

      Show
      RN

      Description

      When testing new MMAS and new LOTC together, we found that MMAS is using nfs
      port (2049) as its heartbeat port in the cluster, and this prevents the
      controllers failover from a failure one to another.

      ----------------------------

      When testing the following combination of applications, the failover
      functionality of the controllers does not working properly. Failure of the
      active SC could result a complete cluster reboot.

      LOTC R3A02, MMAS R12A03

      DESCRIPTION:

      When the failover is taking place, on stand by SC, we can see this error in
      /var/log/messages file:
      Sep 4 11:31:54 SC_2_2 logger: LOTC alarm cache lost (supposing restart).
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Starting ConfD vsn: 2.8
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file confd_cfg.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file confd.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file config.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file netconf.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file netconf_notification.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file netconf_actions.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file netconf_transactions.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file netconf_partial_lock.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file netconf_forward.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file tailf_netconf.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file /home/tspsaf/etc/confd.conf
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Consulting daemon configuration file
      /home/tspsaf/etc/confd.conf
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file
      /home/tspsaf/var/lib/cm/SAF_OAM_x86_64-CXP9013625_3-P2D97_saf_fm_main.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file
      /home/tspsaf/var/lib/cm/JAVA_CAF_x86_64-CXP9013050_2-R7B01_JavaCaf_Classes.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file
      /home/tspsaf/var/lib/cm/CONFIGPKG-CXP9013822_1-R12A03_MMAS_Model_PM_Classes.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file
      /home/tspsaf/var/lib/cm/CONFIGPKG-CXP9013822_1-
      R12A03_MMAS_Model_MMAS_Classes.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file
      /home/tspsaf/var/lib/cm/CONFIGPKG-CXP9013822_1-
      R12A03_MMAS_Model_Licensing_Classes.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file
      /home/tspsaf/var/lib/cm/SAF_SWM_x86_64-CXP9013626_3-P2B25_swm.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file
      /home/tspsaf/var/lib/cm/SAF_OAM_x86_64-CXP9013625_3-P2D97_saf_top_main.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file
      /home/tspsaf/var/lib/cm/SAF_OAM_x86_64-CXP9013625_3-P2D97_aaa.fxs
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Loading file
      /home/tspsaf/var/lib/cm/SAF_OAM_x86_64-CXP9013625_3-P2D97_netconf_rpc.cnc
      Sep 4 11:31:54 SC_2_2 confd[11764]: - CDB load: processing file:
      /opt/saf_oam/cm/confd-cdb/aaa_init.xml
      Sep 4 11:31:54 SC_2_2 confd[11764]: - CDB: Operational DB re-initialized
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Starting to listen for NETCONF SSH on
      0.0.0.0:2022
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Starting to listen for NETCONF TCP on
      192.168.0.249:2023
      Sep 4 11:31:54 SC_2_2 confd[11764]: - Starting to listen for CLI SSH on
      0.0.0.0:2024
      Sep 4 11:31:54 SC_2_2 confd[11764]: - ConfD started
      Sep 4 11:31:55 SC_2_2 kernel: TIPC: Lost link <1.1.2:bond0-1.1.1:bond0> on
      network plane A
      Sep 4 11:31:55 SC_2_2 kernel: TIPC: Lost contact with <1.1.1>
      Sep 4 11:31:56 SC_2_2 fmexec: NO : snmpTargetAddress changed to
      '134.138.126.54:5000'.
      Sep 4 11:31:56 SC_2_2 fmexec: NO : Changed state to Active.
      Sep 4 11:31:59 SC_2_2 kernel: drbd0: Secondary/Primary --> Secondary/Secondary
      Sep 4 11:31:59 SC_2_2 kernel: drbd0: meta connection shut down by peer.
      Sep 4 11:31:59 SC_2_2 kernel: drbd0: drbd0_asender [4718]: cstate Connected -->
      NetworkFailure
      Sep 4 11:31:59 SC_2_2 kernel: drbd0: asender terminated
      Sep 4 11:31:59 SC_2_2 kernel: drbd0: sock was shut down by peer
      Sep 4 11:31:59 SC_2_2 kernel: drbd0: drbd0_receiver [4717]: cstate
      NetworkFailure --> BrokenPipe
      Sep 4 11:31:59 SC_2_2 kernel: drbd0: short read expecting header on sock: r=0
      Sep 4 11:31:59 SC_2_2 kernel: drbd0: worker terminated
      Sep 4 11:31:59 SC_2_2 kernel: drbd0: drbd0_receiver [4717]: cstate BrokenPipe
      --> Unconnected
      Sep 4 11:31:59 SC_2_2 kernel: drbd0: Connection lost.
      Sep 4 11:31:59 SC_2_2 kernel: drbd0: drbd0_receiver [4717]: cstate Unconnected
      --> WFConnection
      Sep 4 11:31:59 SC_2_2 failoverd: DRBD: Connected, Secondary/Primary, Consistent
      -> WFConnection, Secondary/Unknown, Consistent
      Sep 4 11:31:59 SC_2_2 failoverd: Switching Secondary -> Primary
      Sep 4 11:31:59 SC_2_2 failoverd: Task started (/sbin/drbdadm primary drbd0)
      Sep 4 11:31:59 SC_2_2 kernel: drbd0: Secondary/Unknown --> Primary/Unknown
      Sep 4 11:31:59 SC_2_2 failoverd: Task completed (/sbin/drbdadm primary drbd0)
      Sep 4 11:31:59 SC_2_2 failoverd: DRBD: WFConnection, Secondary/Unknown,
      Consistent -> WFConnection, Primary/Unknown, Consistent
      Sep 4 11:31:59 SC_2_2 failoverd: Task started (/etc/init.d/primary start)
      Sep 4 11:31:59 SC_2_2 primary: Mounting DRBD filesystem
      Sep 4 11:31:59 SC_2_2 kernel: kjournald starting. Commit interval 5 seconds
      Sep 4 11:31:59 SC_2_2 kernel: EXT3 FS on drbd0, internal journal
      Sep 4 11:31:59 SC_2_2 kernel: EXT3-fs: mounted filesystem with journal data
      mode.
      Sep 4 11:31:59 SC_2_2 primary: Mounting NFS server filesystem
      Sep 4 11:31:59 SC_2_2 kernel: Installing knfsd (copyright (C) 1996
      okir@monad.swb.de).
      Sep 4 11:31:59 SC_2_2 primary: Exporting NFS filesystem
      Sep 4 11:31:59 SC_2_2 primary: Starting NFS server
      Sep 4 11:31:59 SC_2_2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4
      state recovery directory
      Sep 4 11:31:59 SC_2_2 kernel: NFSD: starting 90-second grace period
      Sep 4 11:31:59 SC_2_2 nfsd[12821]: nfssvc: Address already in use
      Sep 4 11:31:59 SC_2_2 primary: Failed to start NFS daemon
      Sep 4 11:31:59 SC_2_2 failoverd: Task terminated (/etc/init.d/primary start),
      exit code 1
      Sep 4 11:31:59 SC_2_2 failoverd: Panic situation emerging (1/10)
      Sep 4 11:32:00 SC_2_2 failoverd: Task started (/etc/init.d/primary start)
      Sep 4 11:32:00 SC_2_2 syslog-ng[4592]: Connection broken to
      AF_INET(192.168.0.1:514), reopening in 10 seconds
      Sep 4 11:32:00 SC_2_2 primary: Mounting DRBD filesystem
      Sep 4 11:32:00 SC_2_2 primary: Mounting NFS server filesystem
      Sep 4 11:32:00 SC_2_2 primary: Exporting NFS filesystem
      Sep 4 11:32:00 SC_2_2 primary: Starting NFS server
      Sep 4 11:32:00 SC_2_2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4
      state recovery directory
      Sep 4 11:32:00 SC_2_2 kernel: NFSD: starting 90-second grace period
      Sep 4 11:32:00 SC_2_2 nfsd[12835]: nfssvc: Address already in use
      Sep 4 11:32:00 SC_2_2 primary: Failed to start NFS daemon
      Sep 4 11:32:00 SC_2_2 failoverd: Task terminated (/etc/init.d/primary start),
      exit code 1
      Sep 4 11:32:00 SC_2_2 failoverd: Panic situation emerging (2/10)
      Sep 4 11:32:01 SC_2_2 failoverd: Task started (/etc/init.d/primary start)
      Sep 4 11:32:01 SC_2_2 primary: Mounting DRBD filesystem
      Sep 4 11:32:01 SC_2_2 primary: Mounting NFS server filesystem
      Sep 4 11:32:01 SC_2_2 primary: Exporting NFS filesystem
      Sep 4 11:32:01 SC_2_2 primary: Starting NFS server
      Sep 4 11:32:01 SC_2_2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4
      state recovery directory
      Sep 4 11:32:01 SC_2_2 kernel: NFSD: starting 90-second grace period
      Sep 4 11:32:01 SC_2_2 nfsd[12848]: nfssvc: Address already in use
      Sep 4 11:32:01 SC_2_2 primary: Failed to start NFS daemon
      Sep 4 11:32:01 SC_2_2 failoverd: Task terminated (/etc/init.d/primary start),
      exit code 1
      Sep 4 11:32:01 SC_2_2 failoverd: Panic situation emerging (3/10)
      Sep 4 11:32:02 SC_2_2 failoverd: Task started (/etc/init.d/primary start)
      Sep 4 11:32:02 SC_2_2 primary: Mounting DRBD filesystem
      Sep 4 11:32:02 SC_2_2 primary: Mounting NFS server filesystem
      Sep 4 11:32:02 SC_2_2 primary: Exporting NFS filesystem
      Sep 4 11:32:02 SC_2_2 primary: Starting NFS server
      Sep 4 11:32:02 SC_2_2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4
      state recovery directory
      Sep 4 11:32:02 SC_2_2 kernel: NFSD: starting 90-second grace period
      Sep 4 11:32:02 SC_2_2 nfsd[12862]: nfssvc: Address already in use
      Sep 4 11:32:02 SC_2_2 primary: Failed to start NFS daemon
      Sep 4 11:32:02 SC_2_2 failoverd: Task terminated (/etc/init.d/primary start),
      exit code 1
      Sep 4 11:32:02 SC_2_2 failoverd: Panic situation emerging (4/10)
      Sep 4 11:32:02 SC_2_2 kernel: nfs: server 192.168.0.103 not responding, still
      trying
      Sep 4 11:32:03 SC_2_2 kernel: nfs: server 192.168.0.103 not responding, still
      trying
      Sep 4 11:32:03 SC_2_2 failoverd: Task started (/etc/init.d/primary start)
      Sep 4 11:32:03 SC_2_2 primary: Mounting DRBD filesystem
      Sep 4 11:32:03 SC_2_2 primary: Mounting NFS server filesystem
      Sep 4 11:32:03 SC_2_2 primary: Exporting NFS filesystem
      Sep 4 11:32:03 SC_2_2 primary: Starting NFS server
      Sep 4 11:32:03 SC_2_2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4
      state recovery directory
      Sep 4 11:32:03 SC_2_2 kernel: NFSD: starting 90-second grace period
      Sep 4 11:32:03 SC_2_2 nfsd[12874]: nfssvc: Address already in use
      Sep 4 11:32:03 SC_2_2 primary: Failed to start NFS daemon
      Sep 4 11:32:03 SC_2_2 failoverd: Task terminated (/etc/init.d/primary start),
      exit code 1
      Sep 4 11:32:03 SC_2_2 failoverd: Panic situation emerging (5/10)
      Sep 4 11:32:04 SC_2_2 coordinatordwrapper: Dispatching callback
      Sep 4 11:32:04 SC_2_2 coordinatordwrapper: I am in amfHealthcheckCallback for
      safComp=CompT_MMAS_COORDINATOR,safSu=SuT_MMAS_COORDINATOR,safNode=SC_2_2!
      Sep 4 11:32:04 SC_2_2 failoverd: Task started (/etc/init.d/primary start)
      Sep 4 11:32:04 SC_2_2 primary: Mounting DRBD filesystem
      Sep 4 11:32:04 SC_2_2 primary: Mounting NFS server filesystem
      Sep 4 11:32:04 SC_2_2 primary: Exporting NFS filesystem
      Sep 4 11:32:04 SC_2_2 primary: Starting NFS server
      Sep 4 11:32:04 SC_2_2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4
      state recovery directory
      Sep 4 11:32:04 SC_2_2 kernel: NFSD: starting 90-second grace period
      Sep 4 11:32:04 SC_2_2 nfsd[12907]: nfssvc: Address already in use
      Sep 4 11:32:04 SC_2_2 primary: Failed to start NFS daemon
      Sep 4 11:32:04 SC_2_2 failoverd: Task terminated (/etc/init.d/primary start),
      exit code 1
      Sep 4 11:32:04 SC_2_2 failoverd: Panic situation emerging (6/10)
      Sep 4 11:32:04 SC_2_2 coordinatordwrapper_init: wrapper_health
      Sep 4 11:32:04 SC_2_2 coordinatordwrapper: Health check OK
      Sep 4 11:32:05 SC_2_2 failoverd: Task started (/etc/init.d/primary start)
      Sep 4 11:32:05 SC_2_2 primary: Mounting DRBD filesystem
      Sep 4 11:32:05 SC_2_2 primary: Mounting NFS server filesystem
      Sep 4 11:32:05 SC_2_2 primary: Exporting NFS filesystem
      Sep 4 11:32:05 SC_2_2 primary: Starting NFS server
      Sep 4 11:32:05 SC_2_2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4
      state recovery directory
      Sep 4 11:32:05 SC_2_2 kernel: NFSD: starting 90-second grace period
      Sep 4 11:32:05 SC_2_2 nfsd[12929]: nfssvc: Address already in use
      Sep 4 11:32:05 SC_2_2 primary: Failed to start NFS daemon
      Sep 4 11:32:05 SC_2_2 failoverd: Task terminated (/etc/init.d/primary start),
      exit code 1
      Sep 4 11:32:05 SC_2_2 failoverd: Panic situation emerging (7/10)
      Sep 4 11:32:06 SC_2_2 failoverd: Task started (/etc/init.d/primary start)
      Sep 4 11:32:06 SC_2_2 primary: Mounting DRBD filesystem
      Sep 4 11:32:06 SC_2_2 primary: Mounting NFS server filesystem
      Sep 4 11:32:06 SC_2_2 primary: Exporting NFS filesystem
      Sep 4 11:32:06 SC_2_2 primary: Starting NFS server
      Sep 4 11:32:06 SC_2_2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4
      state recovery directory
      Sep 4 11:32:06 SC_2_2 kernel: NFSD: starting 90-second grace period
      Sep 4 11:32:06 SC_2_2 nfsd[12941]: nfssvc: Address already in use
      Sep 4 11:32:06 SC_2_2 primary: Failed to start NFS daemon
      Sep 4 11:32:06 SC_2_2 failoverd: Task terminated (/etc/init.d/primary start),
      exit code 1
      Sep 4 11:32:06 SC_2_2 failoverd: Panic situation emerging (8/10)
      Sep 4 11:32:07 SC_2_2 failoverd: Task started (/etc/init.d/primary start)
      Sep 4 11:32:07 SC_2_2 primary: Mounting DRBD filesystem
      Sep 4 11:32:07 SC_2_2 primary: Mounting NFS server filesystem
      Sep 4 11:32:07 SC_2_2 primary: Exporting NFS filesystem
      Sep 4 11:32:07 SC_2_2 primary: Starting NFS server
      Sep 4 11:32:07 SC_2_2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4
      state recovery directory
      Sep 4 11:32:07 SC_2_2 kernel: NFSD: starting 90-second grace period
      Sep 4 11:32:07 SC_2_2 nfsd[12953]: nfssvc: Address already in use
      Sep 4 11:32:07 SC_2_2 primary: Failed to start NFS daemon
      Sep 4 11:32:07 SC_2_2 failoverd: Task terminated (/etc/init.d/primary start),
      exit code 1
      Sep 4 11:32:07 SC_2_2 failoverd: Panic situation emerging (9/10)
      Sep 4 11:32:08 SC_2_2 failoverd: Task started (/etc/init.d/primary start)
      Sep 4 11:32:08 SC_2_2 primary: Mounting DRBD filesystem
      Sep 4 11:32:08 SC_2_2 primary: Mounting NFS server filesystem
      Sep 4 11:32:08 SC_2_2 primary: Exporting NFS filesystem
      Sep 4 11:32:08 SC_2_2 primary: Starting NFS server
      Sep 4 11:32:08 SC_2_2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4
      state recovery directory
      Sep 4 11:32:08 SC_2_2 kernel: NFSD: starting 90-second grace period
      Sep 4 11:32:08 SC_2_2 nfsd[12969]: nfssvc: Address already in use
      Sep 4 11:32:08 SC_2_2 primary: Failed to start NFS daemon
      Sep 4 11:32:08 SC_2_2 failoverd: Task terminated (/etc/init.d/primary start),
      exit code 1
      Sep 4 11:32:08 SC_2_2 failoverd: Panic situation emerging (10/10)
      Sep 4 11:32:09 SC_2_2 failoverd: PANIC!
      Sep 4 11:32:09 SC_2_2 failoverd: SYSTEM REBOOTING IN 5 SECONDS
      Sep 4 11:32:10 SC_2_2 failoverd: SYSTEM REBOOTING IN 4 SECONDS
      Sep 4 11:32:11 SC_2_2 failoverd: SYSTEM REBOOTING IN 3 SECONDS
      Sep 4 11:32:12 SC_2_2 agentwrapper: Dispatching callback
      Sep 4 11:32:12 SC_2_2 agentwrapper: I am in amfHealthcheckCallback for
      safComp=CompT_MMAS_AGENT,safSu=SuT_MMAS_OAM_PAYLOAD,safNode=SC_2_2!
      Sep 4 11:32:12 SC_2_2 agentwrapper_init: wrapper_health
      Sep 4 11:32:12 SC_2_2 agentwrapper: Health check OK
      Sep 4 11:32:12 SC_2_2 failoverd: SYSTEM REBOOTING IN 2 SECONDS
      Sep 4 11:32:13 SC_2_2 rsyncdwrapper: Dispatching callback
      Sep 4 11:32:13 SC_2_2 rsyncdwrapper: I am in amfHealthcheckCallback for
      safComp=CompT_MMAS_RSYNCD,safSu=SuT_MMAS_RSYNCD,safNode=SC_2_2!
      Sep 4 11:32:13 SC_2_2 rsyncdwrapper_init: wrapper_health
      Sep 4 11:32:13 SC_2_2 rsyncdwrapper: Health check OK
      Sep 4 11:32:13 SC_2_2 failoverd: SYSTEM REBOOTING IN 1 SECONDS
      Sep 4 11:32:14 SC_2_2 failoverd: SYSTEM REBOOTING NOW
      Sep 4 11:32:16 SC_2_2 kernel: md: stopping all md devices.

      Then the whole cluster restarted.

      When issuing netstat command, we know that it was MMAS who uses nfs ports
      mistakenly.
      SC_2_1:~ # netstat -tulp | grep nfs
      udp 0 0 :nfs *:
      12383/java.orig
      udp 0 0 :nfs *:
      10129/java.orig

      SC_2_1:~ # ps -ef | grep java
      mmas 1624 12383 21 10:53 ? 00:04:12
      /opt/jdk1.6.0_12/jre/../bin/java.orig -Djava.net.preferIPv4Stack=true
      Dcom.sun.aas.instanceRoot=/opt/sailfin-v2
      b23/nodedata/nodeagents/SC_2_1/oam_instance_SC_2_1
      -DHTTP_LISTENER_PORT=28080 -DHTTP_SSL_LISTENER_PORT=28181
      -DIIOP_LISTENER_PORT=33700 -DIIOP_SSL_LISTENER_PORT=33821
      -DIIOP_SSL_MUTUALAUTH_PORT=33920 -DINTERNAL_IP=192.168.0.1
      -DJMS_PROVIDER_PORT=37676 -DJMX_SYSTEM_CONNECTOR_PORT=38686
      -DLifecycleModuleService.submitType=sync -DNET_DEVICE=134.138.83.13
      -DSIP_PORT=25060 -DSIP_SSL_PORT=25061 -DSIP_SS_PORT=25062
      -Dcom.sun.aas.ClassPathPrefix=/opt/sailfin-v2-b23/lib/comms-appserv-rt.jar
      -Dcom.sun.aas.ClassPathSuffix= -Dcom.sun.aas.ServerClassPath=
      -Dcom.sun.aas.classloader.appserverChainJars.ee=
      Dcom.sun.aas.classloader.appserverChainJars=admin-cli.jar,admin-cli
      ee.jar,j2ee-svc.jar
      Dcom.sun.aas.classloader.excludesList=admin-cli.jar,appserv-upgrade.jar,sun
      appserv-ant.jar
      -Dcom.sun.aas.classloader.optionalOverrideableChain.ee=
      Dcom.sun.aas.classloader.optionalOverrideableChain=webservices
      rt.jar,webservices-tools.jar
      Dcom.sun.aas.classloader.serverClassPath.ee=/lib/hadbjdbc4.jar,/opt/sailfin-v2
      b23/lib/SUNWjdmk/5.1/lib/jdmkrt.jar,/lib/dbstate.jar,/lib/hadbm.jar,/lib/hadbmgt
      .jar,/opt/sun/mfwk/share/lib/mfwk_instrum_tk.jar
      Dcom.sun.aas.classloader.serverClassPath=/opt/sailfin-v2
      b23/lib/install/applications/jmsra/imqjmsra.jar,/opt/sailfin-v2-
      b23/imq/lib/jaxm-api.jar,/opt/sailfin-v2-b23/imq/lib/fscontext.jar,/opt/sailfin-
      v2-b23/imq/lib/imqbroker.jar,/opt/sailfin-v2-
      b23/imq/lib/imqjmx.jar,/opt/sailfin-v2-b23/lib/ant/lib/ant.jar,/opt/sailfin-v2-
      b23/lib/SUNWjdmk/5.1/lib/jdmkrt.jar
      Dcom.sun.aas.classloader.sharedChainJars.ee=appserv-se.jar,appserv
      ee.jar,jesmf-plugin.jar,/lib/dbstate.jar,/lib/hadbjdbc4.jar,jgroups-
      all.jar,/opt/sun/mfwk/share/lib/mfwk_instrum_tk.jar
      -
      Dcom.sun.aas.classloader.sharedChainJars=javaee.jar,/opt/jdk1.6.0_12/jre/../lib/
      tools.jar,install/applications/jmsra/imqjmsra.jar,com-sun-commons-
      launcher.jar,com-sun-commons-logging.jar,/opt/sailfin-v2-b23/imq/lib/jaxm-
      api.jar,/opt/sailfin-v2-b23/imq/lib/fscontext.jar,/opt/sailfin-v2-
      b23/imq/lib/imqbroker.jar,/opt/sailfin-v2-b23/imq/lib/imqjmx.jar,/opt/sailfin-
      v2-b23/imq/lib/imqxm.jar,webservices-rt.jar,webservices-
      tools.jar,mail.jar,appserv-jstl.jar,jmxremote_optional.jar,/opt/sailfin-v2-
      b23/lib/SUNWjdmk/5.1/lib/jdmkrt.jar,activation.jar,appserv-rt.jar,appserv-
      admin.jar,appserv-cmp.jar,/opt/sailfin-v2-
      b23/updatecenter/lib/updatecenter.jar,/opt/sailfin-v2-
      b23/jbi/lib/jbi.jar,/opt/sailfin-v2-b23/imq/lib/imqjmx.jar,/opt/sailfin-v2-
      b23/lib/ant/lib/ant.jar,dbschema.jar
      -Dcom.sun.aas.configName=oam-config
      -Dcom.sun.aas.configRoot=/opt/sailfin-v2-b23/config
      Dcom.sun.aas.defaultLogFile=/opt/sailfin-v2
      b23/nodedata/nodeagents/SC_2_1/oam_instance_SC_2_1/logs/server.log
      -Dcom.sun.aas.domainName=domain1 -Dcom.sun.aas.installRoot=/opt/sailfin-v2-b23
      -Dcom.sun.aas.instanceName=oam_instance_SC_2_1 -Dcom.sun.aas.processLauncher=SE
      -Dcom.sun.aas.promptForIdentity=true
      -
      Dcom.sun.appserv.pluggable.extensions.amx=org.jvnet.glassfish.comms.admin.manage
      ment.extensions.SIPAMXSupport
      -
      Dcom.sun.appserv.pluggable.features=org.jvnet.glassfish.comms.server.pluggable.e
      xtensions.sip.SipEEPluggableFeatureImpl
      -
      Dcom.sun.enterprise.config.config_environment_factory_class=com.sun.enterprise.c
      onfig.serverbeans.AppserverConfigEnvironmentFactory
      -Dcom.sun.enterprise.overrideablejavaxpackages=javax.help,javax.portlet
      -Dcom.sun.enterprise.server.logging.max_history_files=10
      -Dcom.sun.enterprise.taglibs=appserv-jstl.jar,jsf-impl.jar
      -Dcom.sun.enterprise.taglisteners=jsf-impl.jar
      -Dcom.sun.updatecenter.home=/opt/sailfin-v2-b23/updatecenter
      -Ddomain.name=domain1 -Djava.endorsed.dirs=/opt/sailfin-v2-b23/lib/endorsed
      -
      Djava.ext.dirs=/opt/jdk1.6.0_12/jre/../lib/ext:/opt/jdk1.6.0_12/jre/../jre/lib/e
      xt:/opt/sailfin-v2-
      b23/nodedata/nodeagents/SC_2_1/oam_instance_SC_2_1/lib/ext:/opt/sailfin-v2-
      b23/javadb/lib:/opt/sailfin-v2-b23/lib/jdbcdrivers
      Djava.library.path=/opt/sailfin-v2-b23/lib:/opt/sailfin-v2
      b23/lib:/opt/sailfin-v2-b23/lib
      -Djava.security.auth.login.config=/opt/sailfin-v2-b23/nodedata/nod

      mmas 1869 1624 0 10:53 ? 00:00:00 /bin/sh
      /opt/sailfin-v2-b23/imq/bin/imqbrokerd -javahome /opt/jdk1.6.0_12
      -Dimq.log.file.rolloverbytes=2000000
      -Dimq.cluster.masterbroker=mq://SC_2_2:37676/
      -Dimq.cluster.brokerlist=mq://SC_2_1:37676/,mq://SC_2_2:37676/
      -Dimq.cluster.nowaitForMasterBroker=true -varhome
      /opt/sailfin-v2-b23/nodedata/nodeagents/SC_2_1/oam_instance_SC_2_1/imq
      -startRmiRegistry -rmiRegistryPort 37776 -Dimq.imqcmd.user=admin -passfile
      /tmp/asmq4893742025607423753.tmp -save -name oamoaminstanceSC21 -port 37676
      -bgnd -silent

      mmas 1889 1869 0 10:53 ? 00:00:01 /opt/jdk1.6.0_12/bin/java.orig
      -Djava.net.preferIPv4Stack=true -cp
      /opt/sailfin-v2-b23/imq/bin/../lib/imqbroker.jar:/opt/sailfin-v2-
      b23/imq/bin/../lib/imqutil.jar:/opt/sailfin-v2-
      b23/imq/bin/../lib/jsse.jar:/opt/sailfin-v2-
      b23/imq/bin/../lib/jnet.jar:/opt/sailfin-v2-
      b23/imq/bin/../lib/jcert.jar:/usr/lib/audit/Audit.jar:/opt/sun/mfwk/share/lib/jd
      mkrt.jar:/opt/sun/mfwk/share/lib/mfwk_instrum_tk.jar:/opt/SUNWhadb/4/lib/hadbjdb
      c4.jar:/opt/SUNWjavadb/derby.jar:/usr/share/java/postgresql.jar:/opt/sailfin-v2-
      b23/imq/bin/../lib/ext:/opt/sailfin-v2-b23/imq/bin/../lib/ext
      -Xms192m -Xmx192m -Xss128k -XX:MaxGCPauseMillis=5000
      -Dimq.home=/opt/sailfin-v2-b23/imq/bin/..
      Dimq.varhome=/opt/sailfin-v2
      b23/nodedata/nodeagents/SC_2_1/oam_instance_SC_2_1/imq
      -Dimq.etchome=/opt/sailfin-v2-b23/imq/bin/../etc
      -Dimq.libhome=/opt/sailfin-v2-b23/imq/bin/../lib
      com.sun.messaging.jmq.jmsserver.Broker -javahome /opt/jdk1.6.0_12
      -Dimq.log.file.rolloverbytes=2000000
      -Dimq.cluster.masterbroker=mq://SC_2_2:37676/
      -Dimq.cluster.brokerlist=mq://SC_2_1:37676/,mq://SC_2_2:37676/
      -Dimq.cluster.nowaitForMasterBroker=true -varhome
      /opt/sailfin-v2-b23/nodedata/nodeagents/SC_2_1/oam_instance_SC_2_1/imq
      -startRmiRegistry -rmiRegistryPort 37776 -Dimq.imqcmd.user=admin -passfile
      /tmp/asmq4893742025607423753.tmp -save -name oamoaminstanceSC21 -port 37676
      -bgnd -silent

      mmas 10129 1 11 10:50 ? 00:02:41
      /opt/jdk1.6.0_12/jre/../bin/java.orig -Djava.net.preferIPv4Stack=true
      -Dcom.sun.aas.instanceRoot=/cluster/home/mmas/nodes/DAS/domains/domain1
      -DLifecycleModuleService.submitType=sync
      -Dcom.sun.aas.ClassPathPrefix=/opt/sailfin-v2-b23/lib/comms-appserv-rt.jar
      -Dcom.sun.aas.ClassPathSuffix= -Dcom.sun.aas.ServerClassPath=
      -Dcom.sun.aas.classloader.appserverChainJars.ee=
      Dcom.sun.aas.classloader.appserverChainJars=admin-cli.jar,admin-cli
      ee.jar,j2ee-svc.jar
      Dcom.sun.aas.classloader.excludesList=admin-cli.jar,appserv-upgrade.jar,sun
      appserv-ant.jar
      -Dcom.sun.aas.classloader.optionalOverrideableChain.ee=
      Dcom.sun.aas.classloader.optionalOverrideableChain=webservices
      rt.jar,webservices-tools.jar
      Dcom.sun.aas.classloader.serverClassPath.ee=/lib/hadbjdbc4.jar,/opt/sailfin-v2
      b23/lib/SUNWjdmk/5.1/lib/jdmkrt.jar,/lib/dbstate.jar,/lib/hadbm.jar,/lib/hadbmgt
      .jar,/opt/sun/mfwk/share/lib/mfwk_instrum_tk.jar
      Dcom.sun.aas.classloader.serverClassPath=/opt/sailfin-v2
      b23/lib/install/applications/jmsra/imqjmsra.jar,/opt/sailfin-v2-
      b23/imq/lib/jaxm-api.jar,/opt/sailfin-v2-b23/imq/lib/fscontext.jar,/opt/sailfin-
      v2-b23/imq/lib/imqbroker.jar,/opt/sailfin-v2-
      b23/imq/lib/imqjmx.jar,/opt/sailfin-v2-b23/lib/ant/lib/ant.jar,/opt/sailfin-v2-
      b23/lib/SUNWjdmk/5.1/lib/jdmkrt.jar
      Dcom.sun.aas.classloader.sharedChainJars.ee=appserv-se.jar,appserv
      ee.jar,jesmf-plugin.jar,/lib/dbstate.jar,/lib/hadbjdbc4.jar,jgroups-
      all.jar,/opt/sun/mfwk/share/lib/mfwk_instrum_tk.jar
      -
      Dcom.sun.aas.classloader.sharedChainJars=javaee.jar,/opt/jdk1.6.0_12/jre/../lib/
      tools.jar,install/applications/jmsra/imqjmsra.jar,com-sun-commons-
      launcher.jar,com-sun-commons-logging.jar,/opt/sailfin-v2-b23/imq/lib/jaxm-
      api.jar,/opt/sailfin-v2-b23/imq/lib/fscontext.jar,/opt/sailfin-v2-
      b23/imq/lib/imqbroker.jar,/opt/sailfin-v2-b23/imq/lib/imqjmx.jar,/opt/sailfin-
      v2-b23/imq/lib/imqxm.jar,webservices-rt.jar,webservices-
      tools.jar,mail.jar,appserv-jstl.jar,jmxremote_optional.jar,/opt/sailfin-v2-
      b23/lib/SUNWjdmk/5.1/lib/jdmkrt.jar,activation.jar,appserv-rt.jar,appserv-
      admin.jar,appserv-cmp.jar,/opt/sailfin-v2-
      b23/updatecenter/lib/updatecenter.jar,/opt/sailfin-v2-
      b23/jbi/lib/jbi.jar,/opt/sailfin-v2-b23/imq/lib/imqjmx.jar,/opt/sailfin-v2-
      b23/lib/ant/lib/ant.jar,dbschema.jar
      -Dcom.sun.aas.configName=server-config
      -Dcom.sun.aas.configRoot=/opt/sailfin-v2-b23/config
      -
      Dcom.sun.aas.defaultLogFile=/cluster/home/mmas/nodes/DAS/domains/domain1/logs/se
      rver.log
      -Dcom.sun.aas.domainName=domain1 -Dcom.sun.aas.installRoot=/opt/sailfin-v2-b23
      -Dcom.sun.aas.instanceName=server -Dcom.sun.aas.processLauncher=SE
      -Dcom.sun.aas.promptForIdentity=true
      -
      Dcom.sun.appserv.pluggable.extensions.amx=org.jvnet.glassfish.comms.admin.manage
      ment.extensions.SIPAMXSupport
      -
      Dcom.sun.appserv.pluggable.features=org.jvnet.glassfish.comms.server.pluggable.e
      xtensions.sip.SipEEPluggableFeatureImpl
      -
      Dcom.sun.enterprise.config.config_environment_factory_class=com.sun.enterprise.c
      onfig.serverbeans.AppserverConfigEnvironmentFactory
      -Dcom.sun.enterprise.overrideablejavaxpackages=javax.help,javax.portlet
      -Dcom.sun.enterprise.taglibs=appserv-jstl.jar,jsf-impl.jar
      -Dcom.sun.enterprise.taglisteners=jsf-impl.jar
      -Dcom.sun.updatecenter.home=/opt/sailfin-v2-b23/updatecenter
      -Ddomain.name=domain1 -Djava.endorsed.dirs=/opt/sailfin-v2-b23/lib/endorsed
      -
      Djava.ext.dirs=/opt/jdk1.6.0_12/jre/../lib/ext:/opt/jdk1.6.0_12/jre/../jre/lib/e
      xt:/cluster/home/mmas/nodes/DAS/domains/domain1/lib/ext:/opt/sailfin-v2-
      b23/javadb/lib:/opt/sailfin-v2-b23/lib/jdbcdrivers
      Djava.library.path=/opt/sailfin-v2-b23/lib:/opt/sailfin-v2
      b23/lib:/opt/sailfin-v2-b23/lib
      -
      Djava.security.auth.login.config=/cluster/home/mmas/nodes/DAS/domains/domain1/co
      nfig/login.conf
      -
      Djava.security.policy=/cluster/home/mmas/nodes/DAS/domains/domain1/config/server
      .policy
      -Djava.util.logging.manager=com.sun.enterprise.server.logging.ServerLogManager
      -
      Djavax.management.builder.initial=com.sun.enterprise.ee.admin.AppServerMBeanServ
      erBuilder
      -
      Djavax.net.ssl.keyStore=/cluster/home/mmas/nodes/DAS/domains/domain1/config/keys
      tore.jks
      -Djavax.net.ssl.trustStore=/cluster/home/mmas/nodes/DAS/domains/domain1/c

      root 12005 31121 0 11:13 pts/0 00:00:00 grep java
      mmas 12383 1 1 10:50 ? 00:00:17
      /opt/jdk1.6.0_12/jre/../bin/java.orig -Djava.net.preferIPv4Stack=true
      -Dcom.sun.aas.instanceRoot=/opt/sailfin-v2-b23/nodedata/nodeagents/SC_2_1/agent
      -Dcom.sun.aas.configRoot=/opt/sailfin-v2-b23/config
      Dcom.sun.aas.defaultLogFile=/opt/sailfin-v2
      b23/nodedata/nodeagents/SC_2_1/agent/logs/server.log
      -Dcom.sun.aas.instanceName=SC_2_1 -Dcom.sun.aas.isNodeAgent=true
      -Dcom.sun.aas.promptForIdentity=true
      -
      Dcom.sun.appserv.admin.pluggable.features=com.sun.enterprise.ee.admin.pluggable.
      EEClientPluggableFeatureImpl
      Dcom.sun.appserv.nss.db=/opt/sailfin-v2
      b23/nodedata/nodeagents/SC_2_1/agent/config
      -
      Dcom.sun.appserv.pluggable.features=com.sun.enterprise.ee.server.pluggable.EEPlu
      ggableFeatureImpl
      -Djava.endorsed.dirs=/opt/sailfin-v2-b23/lib/endorsed
      Djava.library.path=/opt/sailfin-v2-b23/lib:/opt/sailfin-v2
      b23/lib:/opt/sailfin-v2-b23/lib
      Djava.security.auth.login.config=/opt/sailfin-v2
      b23/nodedata/nodeagents/SC_2_1/agent/config/login.conf
      -Djava.util.logging.manager=com.sun.enterprise.server.logging.ServerLogManager
      -Djmx.invoke.getters=true -XX:+UnlockDiagnosticVMOptions
      -XX:LogFile=/opt/sailfin-v2-b23/nodedata/nodeagents/SC_2_1/agent/logs/jvm.log
      -XX:+LogVMOutput -cp
      /opt/sailfin-v2-b23/lib/comms-appserv-rt.jar:/opt/sailfin-v2-b23/lib/appserv-
      launch.jar:/opt/sailfin-v2-b23/lib/appserv-se.jar:/opt/sailfin-v2-
      b23/lib/appserv-admin.jar:/opt/sailfin-v2-b23/lib/javaee.jar:/opt/sailfin-v2-
      b23/lib/appserv-rt.jar:/opt/sailfin-v2-b23/lib/appserv-ext.jar:/opt/sailfin-v2-
      b23/lib/shoal-gms.jar:/opt/sailfin-v2-b23/lib/jxta.jar:/opt/sailfin-v2-
      b23/lib/appserv-ee.jar
      com.sun.enterprise.ee.nodeagent.NodeAgentMain start
      startInstancesOverride=false syncInstances=true monitorInterval=5
      restartInstances=true

      MEASURES:

      We found that in our domain.xml file, in cluster section, the 2049 port has been
      assigned as the heartbeat-port of traffic cluster:

      <cluster config-ref="traffic-config" heartbeat-address="228.8.22.9"
      heartbeat-enabled="true" heartbeat-port="2049" name="traffic">
      <server-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="true" ref="traffic_instance_PL_2_7"/>
      <server-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="true" ref="traffic_instance_PL_2_10"/>
      <server-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="true" ref="traffic_instance_PL_2_9"/>
      <server-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="true" ref="traffic_instance_PL_2_6"/>
      <server-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="true" ref="traffic_instance_PL_2_8"/>
      <server-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="true" ref="traffic_instance_PL_2_5"/>
      <server-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="true" ref="traffic_instance_PL_2_4"/>
      <server-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="true" ref="traffic_instance_PL_2_3"/>
      <resource-ref enabled="true" ref="jdbc/__CallFlowPool"/>
      <resource-ref enabled="true" ref="jca/Licensing"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="__ejb_container_timer_app"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="__JWSappclients"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="MEjbApp"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="WSTXServices"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="JBIFramework"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="WSTCPConnectorLCModule"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="SipContainerLifecycle"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="cluster-traffic-CAF-lifecycle"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="cluster-traffic-MMAS-Logging-lifecycle"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="cluster-traffic-MMAS-Monitoring-lifecycle"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="cluster-traffic-MMAS-PM-lifecycle"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="cluster-traffic-MMAS-DNS-lifecycle"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="cluster-traffic-MMAS-Statistics-lifecycle"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="MMASLMF"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="InviteServlet-ear-1.0"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="OptionsServlet-ear-1.0"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="PmServlet-ear-1.0"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="PresenceServlet-ear-1.0"/>
      <application-ref disable-timeout-in-minutes="30" enabled="true"
      lb-enabled="false" ref="SimpleHTTPServlet-ear-1.0"/>
      </cluster>

      This might be the reason of the problem.

      However, since this configuration is not set by anyone manually in MMAS teams.
      Seems it was SGCS who decide to use this port when creating the cluster.
      Therefore, a patch from SUN is needed for SGCS to avoid using this kind of ports
      in the future.

        Activity

        Hide
        Joe Fialli added a comment -

        Additional info to release note:

        One could assume that the automated heartbeat-port selection is always invalid
        and part of starting up the cluster is to set the heartbeat-port (using asadmin
        CLI or admin GUI) to a port that is known to be okay to use on the system.

        Here is how to do it for asadmin CLI for a cluster named
        "application-create-cluster".

        $

        {SF_HOME}

        /bin/asadmin set application-created-cluster.heartbeat-port=48991

        Show
        Joe Fialli added a comment - Additional info to release note: One could assume that the automated heartbeat-port selection is always invalid and part of starting up the cluster is to set the heartbeat-port (using asadmin CLI or admin GUI) to a port that is known to be okay to use on the system. Here is how to do it for asadmin CLI for a cluster named "application-create-cluster". $ {SF_HOME} /bin/asadmin set application-created-cluster.heartbeat-port=48991
        Hide
        prasads added a comment -

        Adding to a release note

        Show
        prasads added a comment - Adding to a release note
        Hide
        prasads added a comment -

        Temporarily marking these issues as P4, for the Sailfin 2.0 release. The
        priority will be restored for the next release.

        Show
        prasads added a comment - Temporarily marking these issues as P4, for the Sailfin 2.0 release. The priority will be restored for the next release.
        Hide
        chinmayee_srivathsa added a comment -

        Release Noted as follows:
        Communications Server does not detect conflicts with the heartbeat port of a
        cluster (Issue number 1967)
        Description
        When a cluster is created, Communications Server randomly assigns a heartbeat port
        between 1026 to 45556. For default-cluster, which is the default cluster created
        by a Communications Server installation, a random number selected between 0 to
        45556. The cluster creation process does not accurately detect if the heartbeat
        port is already being used by another service.

        Solution
        If automated cluster creation configuration selects a heartbeat port that is in
        conflict with another service that is already using that port, update the
        cluster heartbeat port to a port that is not being used by the system.
        To change the heartbeat port of a cluster, use the following asadmin command:
        asadmin set <cluster-name>.heartbeat-port=<newportnumber>

        Show
        chinmayee_srivathsa added a comment - Release Noted as follows: Communications Server does not detect conflicts with the heartbeat port of a cluster (Issue number 1967) Description When a cluster is created, Communications Server randomly assigns a heartbeat port between 1026 to 45556. For default-cluster, which is the default cluster created by a Communications Server installation, a random number selected between 0 to 45556. The cluster creation process does not accurately detect if the heartbeat port is already being used by another service. Solution If automated cluster creation configuration selects a heartbeat port that is in conflict with another service that is already using that port, update the cluster heartbeat port to a port that is not being used by the system. To change the heartbeat port of a cluster, use the following asadmin command: asadmin set <cluster-name>.heartbeat-port=<newportnumber>
        Hide
        Joe Fialli added a comment -

        Fix is to allocate cluster heartbeatport from a random number in unused IANA
        port range of 34380-34961.
        (Doc: http://www.iana.org/assignments/port-numbers.)

        Just awaiting code review.

        Show
        Joe Fialli added a comment - Fix is to allocate cluster heartbeatport from a random number in unused IANA port range of 34380-34961. (Doc: http://www.iana.org/assignments/port-numbers .) Just awaiting code review.

          People

          • Assignee:
            Joe Fialli
            Reporter:
            jimdumont
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated: