glassfish
  1. glassfish
  2. GLASSFISH-18208

probable ssh connection leak: ssh connection between das and agent machines sometimes dead while running HA test on windows 2008

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.1.2_b17
    • Fix Version/s: 3.1.2_b19, 4.0
    • Component/s: admin
    • Labels:
      None
    • Environment:

      windows 2008

      Description

      build: v3.1.2 promoted build17
      OS: windows 2008
      Has this passed before: Yes, with GF 3.1.

      While running HA/MQ test on windows 2008, the ssh connection sometimes randomly dead and displayed the following error when staring cluster/instances:

      "Command execution failed. There was a problem while connecting to bigapp-x2250-3.us.oracle.com:22:Connection reset
      [testng]
      [testng] Please verify you have SSH configured correctly on your system with the proper attributes set on node agent2. You may use update-node-ssh to modify these attributes. See the DAS log file for more information.
      [testng]
      [testng]
      [testng] [testng] To complete this operation run the following command locally on host bigapp-x2250-3.us.oracle.com from the GlassFish install location C:/ha/glassfish3:
      [testng]
      [testng] bin/asadmin start-local-instance --node agent2 --sync normal instance107
      [testng]
      [testng] instance102: Could not start instance instance102 on node agent3 (ha-qe-2.us.oracle.com).
      [testng]
      [testng]
      [testng]
      [testng] This command requires connecting to host ha-qe-2.us.oracle.com using SSH to complete its operation, but it failed to connect:
      [testng]
      [testng] Command execution failed. There was a problem while connecting to ha-qe-2.us.oracle.com:22:Connection reset
      [testng]
      [testng] Please verify you have SSH configured correctly on your system with the proper attributes set on node agent3. You may use update-node-ssh to modify these attributes. See the DAS log file for more information.
      [testng]
      "

      I tried manually to do ssh from the das machine jed-asqe-23 to the agent machines ha-qe-2 and bigapp-x2250-3, all failed. See below:

      $ ssh ha-qe-2 uname -a
      ssh_exchange_identification: Connection closed by remote host

      $ ssh bigapp-x2250-3.us.oracle.com uname -a
      ssh_exchange_identification: Connection closed by remote host

      **After I restarted domain, the ssh connection rebuilt.**

      I collected the jmap of the domain process before the ssh connection stopped and after by using the command jmap -histo:live domain-pid > jmap.file (domain-pid is the pid of the domain's process id)
      I also collected the jstack of the domain process after the ssh connection dead using command jstack -l domain-pid > jstack.file

      The jmap.file.before(the generated jmap file before the ssh connection stopped), jmap.file.after(the generated jmap file after the ssh connection stopped) and jstack.file.after files have been attached to the bug. Please let me know if you want to look at the HA test systems to get more details.

      1. jmap.file.after
        285 kB
        sonialiu
      2. jmap.file.before
        286 kB
        sonialiu
      3. jstack.file.after
        235 kB
        sonialiu

        Issue Links

          Activity

          Hide
          varunrupela added a comment -

          Updated summary to indicate a ssh connection leak.
          jmap.file.before shows 24 instances of com.trilead.ssh2.Connection
          jmap.file.before shows 104 instances of com.trilead.ssh2.Connection
          jstack shows ~150 threads in waiting state.

          Show
          varunrupela added a comment - Updated summary to indicate a ssh connection leak. jmap.file.before shows 24 instances of com.trilead.ssh2.Connection jmap.file.before shows 104 instances of com.trilead.ssh2.Connection jstack shows ~150 threads in waiting state.
          Hide
          Joe Di Pol added a comment -

          An example stack trace from jstack.file.after:

          "Thread-293" daemon prio=6 tid=0x0000000022203000 nid=0x1500 runnable [0x000000002542e000]
          java.lang.Thread.State: RUNNABLE
          at java.net.SocketInputStream.socketRead0(Native Method)
          at java.net.SocketInputStream.read(SocketInputStream.java:150)
          at java.net.SocketInputStream.read(SocketInputStream.java:121)
          at com.trilead.ssh2.crypto.cipher.CipherInputStream.fill_buffer(CipherInputStream.java:41)
          at com.trilead.ssh2.crypto.cipher.CipherInputStream.internal_read(CipherInputStream.java:52)
          at com.trilead.ssh2.crypto.cipher.CipherInputStream.getBlock(CipherInputStream.java:79)
          at com.trilead.ssh2.crypto.cipher.CipherInputStream.read(CipherInputStream.java:108)
          at com.trilead.ssh2.transport.TransportConnection.receiveMessage(TransportConnection.java:232)
          at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:672)
          at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:470)
          at java.lang.Thread.run(Thread.java:722)

          Show
          Joe Di Pol added a comment - An example stack trace from jstack.file.after: "Thread-293" daemon prio=6 tid=0x0000000022203000 nid=0x1500 runnable [0x000000002542e000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:150) at java.net.SocketInputStream.read(SocketInputStream.java:121) at com.trilead.ssh2.crypto.cipher.CipherInputStream.fill_buffer(CipherInputStream.java:41) at com.trilead.ssh2.crypto.cipher.CipherInputStream.internal_read(CipherInputStream.java:52) at com.trilead.ssh2.crypto.cipher.CipherInputStream.getBlock(CipherInputStream.java:79) at com.trilead.ssh2.crypto.cipher.CipherInputStream.read(CipherInputStream.java:108) at com.trilead.ssh2.transport.TransportConnection.receiveMessage(TransportConnection.java:232) at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:672) at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:470) at java.lang.Thread.run(Thread.java:722)
          Hide
          Joe Di Pol added a comment -

          The stack trace is very similar to those in GLASSFISH-16910 which, in theory, was fixed in 3.1.2.

          Show
          Joe Di Pol added a comment - The stack trace is very similar to those in GLASSFISH-16910 which, in theory, was fixed in 3.1.2.
          Hide
          Yamini K B added a comment -

          It appears this bug is not just on Windows. Looking at the code and the test logs, ssh connection leak is happening from stop-instance/stop-cluster. I checked the Solaris HA setup as well and the problem is seen there too. I then reproduced the problem on my setup and have verified a fix as well. I will check it in after I get an approval.

          Show
          Yamini K B added a comment - It appears this bug is not just on Windows. Looking at the code and the test logs, ssh connection leak is happening from stop-instance/stop-cluster. I checked the Solaris HA setup as well and the problem is seen there too. I then reproduced the problem on my setup and have verified a fix as well. I will check it in after I get an approval.
          Hide
          Yamini K B added a comment -
          • What is the impact on the customer of the bug?

          Problem is reproducible very easily, ssh connections are being leaked (1 per invocation of stop-instance or stop-cluster) This can be serious when the resource limits are reached. Like mentioned in the description, ssh connections will start failing.

          • What is the cost/risk of fixing the bug?

          Fix is simple and low risk since it doesn't impact any other module.

          • Is there an impact on documentation or message strings?

          No.

          • Which tests should QA (re)run to verify the fix did not destabilize GlassFish?

          HA tests (which continuously stop, start clusters)

          • Which is the targeted build of 3.1.2 for this fix?

          B19

          Show
          Yamini K B added a comment - What is the impact on the customer of the bug? Problem is reproducible very easily, ssh connections are being leaked (1 per invocation of stop-instance or stop-cluster) This can be serious when the resource limits are reached. Like mentioned in the description, ssh connections will start failing. What is the cost/risk of fixing the bug? Fix is simple and low risk since it doesn't impact any other module. Is there an impact on documentation or message strings? No. Which tests should QA (re)run to verify the fix did not destabilize GlassFish? HA tests (which continuously stop, start clusters) Which is the targeted build of 3.1.2 for this fix? B19
          Hide
          Yamini K B added a comment -

          Fixed the connection leaks in setup-ssh (remote), install-node, uninstall-node, stop-instance
          3.1.2: Rev52249
          Trunk: Rev52254

          Show
          Yamini K B added a comment - Fixed the connection leaks in setup-ssh (remote), install-node, uninstall-node, stop-instance 3.1.2: Rev52249 Trunk: Rev52254

            People

            • Assignee:
              Yamini K B
              Reporter:
              sonialiu
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: