Issue Details (XML | Word | Printable)

Key: GLASSFISH-15537
Type: Bug Bug
Status: Open Open
Priority: Major Major
Assignee: Byron Nevins
Reporter: Joe Di Pol
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
glassfish

restart-instance takes a long time on Solaris 11 (sun.security.pkcs11.SunPKCS11)

Created: 11/Jan/11 05:30 PM   Updated: 18/Feb/13 06:04 PM  Due: 18/Apr/11
Component/s: admin
Affects Version/s: 3.1
Fix Version/s: future release

Time Tracking:
Not Specified

File Attachments: 1. File rpc.c (15 kB) 18/Jan/11 05:51 PM - Joe Di Pol

Environment:

Oracle Solaris 11 Express snv_151a X86
java version "1.6.0_23"
Private build based on r44408


Tags: 3_1-exclude 3_1-next_release-note-added 3_1-next_release-notes 3_1_1-scrubbed 3_1_x-exclude
Participants: Byron Nevins, Chris Kasso, Joe Di Pol, Paul Davies and Scott Fordin


 Description  « Hide

restart-instance often takes minutes to execute on Solaris 11. To reproduce:

  1. Start the DAS. On the DAS machine:
    asadmin create-local-instance i1
    asadmin start-local-instance i1
    asadmin stop-local-instance i1
    asadmin start-local-instance i1
  1. So far, so good. This all works great. I can stop/start
  2. many times and I don't see the problem. Then do:

asadmin restart-instance i1

Sometimes this returns promptly. But if I run restart repeatedly
often it takes a while to return (one time 9 minutes).
The instance does start eventually. If I do a jstack of
the instance process while it is "hung" I see consistently this
thread which appears to be blocked in a pkcs11 call (see stack
trace below).

I have not been able to reproduce this on Linux.

"FelixStartLevel" daemon prio=3 tid=0x089a5400 nid=0xe runnable [0xcd17e000]
java.lang.Thread.State: RUNNABLE
at sun.security.pkcs11.wrapper.PKCS11.C_Initialize(Native Method)
at sun.security.pkcs11.wrapper.PKCS11.getInstance(PKCS11.java:160)

  • locked <0xcebc0c58> (a java.lang.Class for sun.security.pkcs11.wrapper.PKCS11)
    at sun.security.pkcs11.SunPKCS11.<init>(SunPKCS11.java:281)
    at sun.security.pkcs11.SunPKCS11.<init>(SunPKCS11.java:86)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at sun.security.jca.ProviderConfig$3.run(ProviderConfig.java:243)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.security.jca.ProviderConfig.doLoadProvider(ProviderConfig.java:225)
    at sun.security.jca.ProviderConfig.getProvider(ProviderConfig.java:205)
  • locked <0xf1e8f838> (a sun.misc.Launcher$AppClassLoader)
    at sun.security.jca.ProviderList.getProvider(ProviderList.java:215)
    at sun.security.jca.ProviderList.getService(ProviderList.java:313)
    at sun.security.jca.GetInstance.getInstance(GetInstance.java:140)
    at java.security.Security.getImpl(Security.java:659)
    at java.security.MessageDigest.getInstance(MessageDigest.java:129)
    at java.io.ObjectStreamClass.computeDefaultSUID(ObjectStreamClass.java:1759)
    at java.io.ObjectStreamClass.access$100(ObjectStreamClass.java:52)
    at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:205)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.io.ObjectStreamClass.getSerialVersionUID(ObjectStreamClass.java:202)
    at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:558)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1583)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1732)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
    at java.util.HashMap.readObject(HashMap.java:1030)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1849)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
    at org.jvnet.hk2.osgiadapter.OSGiModulesRegistryImpl.loadCachedData(OSGiModulesRegistryImpl.java:195)
    at org.jvnet.hk2.osgiadapter.OSGiModulesRegistryImpl.<init>(OSGiModulesRegistryImpl.java:91)
    at org.jvnet.hk2.osgiadapter.OSGiFactoryImpl.createModulesRegistry(OSGiFactoryImpl.java:70)
    at org.jvnet.hk2.osgiadapter.OSGiFactoryImpl.createModulesRegistry(OSGiFactoryImpl.java:52)
    at org.jvnet.hk2.osgiadapter.HK2Main.createModulesRegistry(HK2Main.java:185)
    at org.jvnet.hk2.osgiadapter.HK2Main.start(HK2Main.java:142)
    at org.apache.felix.framework.util.SecureAction.startActivator(SecureAction.java:629)
    at org.apache.felix.framework.Felix.activateBundle(Felix.java:1827)
    at org.apache.felix.framework.Felix.startBundle(Felix.java:1744)
    at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:922)
    at org.glassfish.kernel.GlassFishActivator.startBundle(GlassFishActivator.java:239)
    at org.glassfish.kernel.GlassFishActivator.startBundles(GlassFishActivator.java:194)
    at org.glassfish.kernel.GlassFishActivator.start(GlassFishActivator.java:80)
    at org.apache.felix.framework.util.SecureAction.startActivator(SecureAction.java:629)
    at org.apache.felix.framework.Felix.activateBundle(Felix.java:1827)
    at org.apache.felix.framework.Felix.startBundle(Felix.java:1744)
    at org.apache.felix.framework.BundleImpl.start(BundleImpl.java:922)
    at org.jvnet.hk2.osgimain.Main.start(Main.java:154)
    at org.apache.felix.framework.util.SecureAction.startActivator(SecureAction.java:629)
    at org.apache.felix.framework.Felix.activateBundle(Felix.java:1827)
    at org.apache.felix.framework.Felix.startBundle(Felix.java:1744)
    at org.apache.felix.framework.Felix.setActiveStartLevel(Felix.java:1148)
    at org.apache.felix.framework.StartLevelImpl.run(StartLevelImpl.java:264)
    at java.lang.Thread.run(Thread.java:619)

The instance log file has nothing of interest. The message before the "hang" is:

Jan 11, 2011 5:21:30 PM com.sun.enterprise.admin.launcher.GFLauncherLogger info
INFO: Successfully launched in 39 msec.

and the first message after the "hang" is:

Jan 11, 2011 5:25:02 PM null
INFO: Running GlassFish Version: GlassFish Server Open Source Edition 3.1-SNAPSHOT (build dipol-private)
[#|2011-01-11T17:25:02.483-0800|INFO|glassfish3.1|null|_ThreadID=12;_ThreadName=Thread-1;|Grizzly Framework 1.9.28 started in: 89ms - bound to [0.0.0.0:28080]|#]



Chris Kasso added a comment - 12/Jan/11 10:30 AM

I'm able to reproduce this as well on S11. It occurred once in seven restarts (it took six minutes to restart):

ouch: time ./asadmin restart-instance i1
i1 was restarted.
Command restart-instance executed successfully.

real 0m19.39s
user 0m16.59s
sys 0m0.83s
ouch: time ./asadmin restart-instance i1
i1 was restarted.
Command restart-instance executed successfully.

real 0m32.63s
user 0m19.04s
sys 0m0.83s
ouch: time ./asadmin restart-instance i1
i1 was restarted.
Command restart-instance executed successfully.

real 6m41.79s
user 0m21.63s
sys 0m1.22s
ouch: cd v3
ouch: time ./asadmin restart-instance i1
i1 was restarted.
Command restart-instance executed successfully.

real 0m32.51s
user 0m19.70s
sys 0m1.26s
ouch: time ./asadmin restart-instance i1
i1 was restarted.
Command restart-instance executed successfully.

real 0m31.14s
user 0m18.31s
sys 0m0.99s
ouch: time ./asadmin restart-instance i1
i1 was restarted.
Command restart-instance executed successfully.

real 0m22.61s
user 0m20.44s
sys 0m0.85s
ouch: time ./asadmin restart-instance i1
i1 was restarted.
Command restart-instance executed successfully.

real 0m56.26s
user 0m19.77s
sys 0m0.81s


Byron Nevins added a comment - 12/Jan/11 04:21 PM

Solaris 11 machine I don't have.

Also – the hung thread is in Felix. Hint.

Joe/Chris - Please try again using no OSGi and then Equinox.


Joe Di Pol added a comment - 12/Jan/11 05:11 PM - edited

I set

GlassFish_Platform=Static

and could still reproduce the problem. The thread trace is now different. It looks like the first call that causes the PKCS11 security provider to be initialized triggers the problem. The weird thing is that this only seems to happen on re-start. I haven't seen this when starting an instance.

The stack trace for when Felix is out of the loop:

"main" prio=3 tid=0x08072c00 nid=0x2 runnable [0xfe2fd000]
java.lang.Thread.State: RUNNABLE
at sun.security.pkcs11.wrapper.PKCS11.C_Initialize(Native Method)
at sun.security.pkcs11.wrapper.PKCS11.getInstance(PKCS11.java:160)

  • locked <0xcef09e80> (a java.lang.Class for sun.security.pkcs11.wrapper.PKCS11)
    at sun.security.pkcs11.SunPKCS11.<init>(SunPKCS11.java:281)
    at sun.security.pkcs11.SunPKCS11.<init>(SunPKCS11.java:86)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at sun.security.jca.ProviderConfig$3.run(ProviderConfig.java:243)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.security.jca.ProviderConfig.doLoadProvider(ProviderConfig.java:225)
    at sun.security.jca.ProviderConfig.getProvider(ProviderConfig.java:205)
  • locked <0xda6007f8> (a sun.misc.Launcher$AppClassLoader)
    at sun.security.jca.ProviderList.getProvider(ProviderList.java:215)
    at sun.security.jca.ProviderList$3.get(ProviderList.java:130)
    at sun.security.jca.ProviderList$3.get(ProviderList.java:125)
    at java.util.AbstractList$Itr.next(AbstractList.java:345)
    at java.security.SecureRandom.getPrngAlgorithm(SecureRandom.java:522)
    at java.security.SecureRandom.getDefaultPRNG(SecureRandom.java:165)
    at java.security.SecureRandom.<init>(SecureRandom.java:133)
    at

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
com.sun.enterprise.v3.admin.LocalPasswordImpl.postConstruct(LocalPasswordImpl.java:87)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
at com.sun.hk2.component.AbstractCreatorImpl.inject(AbstractCreatorImpl.java:128)
at com.sun.hk2.component.ConstructorCreator.initialize(ConstructorCreator.java:88)
at com.sun.hk2.component.AbstractCreatorImpl.get(AbstractCreatorImpl.java:79)
at com.sun.hk2.component.SingletonInhabitant.get(SingletonInhabitant.java:64)

  • locked <0xf6fe29b8> (a com.sun.hk2.component.SingletonInhabitant)
    at com.sun.hk2.component.EventPublishingInhabitant.get(EventPublishingInhabitant.java:136)
    at com.sun.hk2.component.AbstractInhabitantImpl.get(AbstractInhabitantImpl.java:73)
    at com.sun.enterprise.v3.server.AppServerStartup.run(AppServerStartup.java:219)
    at com.sun.enterprise.v3.server.AppServerStartup.start(AppServerStartup.java:135)
  • locked <0xfa648f10> (a com.sun.enterprise.v3.server.AppServerStartup)
    at com.sun.enterprise.glassfish.bootstrap.GlassFishImpl.start(GlassFishImpl.java:79)
  • locked <0xfa648ef0> (a com.sun.enterprise.glassfish.bootstrap.StaticGlassFishRuntime$1)
    at com.sun.enterprise.glassfish.bootstrap.GlassFishMain$Launcher.launch(GlassFishMain.java:117)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at com.sun.enterprise.glassfish.bootstrap.GlassFishMain.main(GlassFishMain.java:97)
    at com.sun.enterprise.glassfish.bootstrap.ASMain.main(ASMain.java:55)

Byron Nevins added a comment - 12/Jan/11 07:47 PM

From the internet:

No it hasn't anything to do with the classpath I've used...

I've investigated the problem a little bit further by profiling my
simple jdbc test class and found out that most of the time is spent in
this class:

sun.security.pkcs11.wrapper.PKCS11.C_Initialize

98% of the time is spent in that function with 3694 calls (well it
will last forever if I don't kill the process)!

So I guess it's a problem with Sun/Solaris security mixed with nfs issue...

Christian

On Feb 16, 2008 4:09 PM, Dave Cramer <pg(at)fastcrypt(dot)com> wrote:
>
> On 16-Feb-08, at 2:45 PM, Christian Bourque wrote:
>
> > Hi Tom,
> >
> > I think you were right, there is something in the home directory that
> > is conflicting!
> >
> > I don't understand why jdbc would use something in the user's home
> > directory though...
> >
> > Anyway the workaround I found was to not map the share directly under
> > the user's home account and now it works!
> >
> My guess is your classpath ?
>
> Dave
> > Thanks for your help!
> >
> > Christian
> >
> > On Feb 16, 2008 1:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> "Christian Bourque" <christian(dot)bourque(at)gmail(dot)com> writes:
> >>> The pg server is running on server B (openSUSE) and I'm connecting
> >>> to
> >>> it from server C (Solaris 10) using a user account with an nfs
> >>> mounted
> >>> directory from server A (openSUSE).
> >>
> >>> If I unmount the nfs share it works! But with the share it doesn't,
> >>> the connection seems to reach the server but it gets stuck there, no
> >>> error and it doesn't timeout either...
> >>
> >> It's way too hard to believe that NFS per se is interfering.
> >>
> >> What I could believe is that there is some configuration-type file in
> >> your home directory that is causing a problem, and after the unmount
> >> it's not visible so no problem. There are obvious possibilities for
> >> this such as ~/.psqlrc if you're using psql/libpq, but I dunno enough
> >> about the JDBC environment to guess whether it has equivalents.
> >>
> >> If all else fails, you could try strace'ing the application (or
> >> whatever
> >> Solaris' equivalent to strace is) in both cases and comparing
> >> results.
> >> That would at least make it clearer where it's hanging up ...
> >>
> >> regards, tom lane
> >>
> >


Byron Nevins added a comment - 12/Jan/11 09:29 PM

My Analysis:

Facts:

  • PKCS native code has an issue in Solaris 11
  • The problem never happens Solaris < 11 or any other OS
  • The problem never happens with start-domain – only with restart
  • It is intermittent
  • It has nothing to do with LocalPasswordImpl – it hangs with the first of many calls from GF
    to PKCS code. LocalPasswordImpl happens to be the first when running w/o OSGi

Conclusions:

  • It's a race condition inside the native code. The restarting server is coming up too fast for the native code to completely finish. Sometimes.

Fix:
restart-server should wait a bit longer before starting – at least in this one platform.


Byron Nevins added a comment - 13/Jan/11 10:09 AM

======== Here is how restart works in this bug's case

1) Inside the server a new JVM process is spawned to run start-domain
2) The server continues dying
3) Meantime the new JVM is running start-domain from AsadminMain
4) The new JVM waits for its parent process to die
5) start-domain then proceeds as normal

step (4) is difficult. Impossible without using native calls. Here is how it does it now:

(1) It waits for a read call on System.in to return -1.
(2) It positively checks all the ports the server was using

When all the ports are free – we define the server as dead.
==================================================================

THe next step would be to absolutely positively verify that the process itself is gone. Luckily we have ProcessUtils available!


Byron Nevins added a comment - 13/Jan/11 10:44 AM

PROBLEM:

In Solaris 11 restart-domain intermittently hangs. This is caused by Security SPI native code from sun that is included in the JDK.

It has only been observed on Solaris 11 systems at a ~~ 10-15% frequency

How bad is its impact? (Severity)
When it hits – the server hangs and now you can never restart it again. I assume it requires a forcedful kill.

How often does it happen?
10-15% on Sol 11

Will many users see this problem? (Frequency)
The propertion of users using any kind of Solaris is small, I assume. The proportion of them that are using Sol 11 will be small but growing over time.

How much effort is required to fix it? (Cost)
2-3 days. A lot of time-consuming testing, getting remote machines configured, etc. It's intermittent so tests have to be run over and over and over...

What is the risk of fixing it and how will the risk be mitigated? (Risk)
The risk is that restart-server stops working on other platforms
Mitigated by being very careful & conservative. I will probably end up with special
code that only runs on Sol 11 machines.


Joe Di Pol added a comment - 13/Jan/11 03:26 PM - edited

Here is some output from pstack on a "hung" instance. I also ran pfiles, but that didn't show much. I'm not sure what is going on, but some observations:

  • The trace has sendTCSDPacket() on it
  • tcsd(8) is a daemon that acts as the portal to the TPM (Trusted Platform Module) device driver.
  • I think the TPM device driver is how hardware crypto devices are accessed.
  • tcsd(8) is not running on my system because there is no /dev/tpm (maybe because I have no crypto chip on my system?).

Since start-instance works OK the problem is not just that tcsd(8) is not running, but it would be interesting to know if the problem happens on an S11 system where tcsd(8) is running.

----------------- lwp# 2 / thread# 2 --------------------
fee91b55 connect (100, fe2ed1c0, 10, 1)
fed9ca03 connect (100, fe2ed1c0, 10, ccfaf417) + 23
ccfaf480 send_init (9c81718, 0, 9b5e1a0, ccfaf16a) + b8
ccfaf22d sendTCSDPacket (9c81718, 0, cd170a40, ccfaafb2) + d1
ccfaafe0 RPC_OpenContext_TP (9c81718, fe2ed278, fe2ed27c, fe2ed274) + 3c
ccfab12d RPC_OpenContext (c0000004, 81bf330, 1, ccfaa93e) + 4d
ccfaa9d1 Tspi_Context_Connect (c0000004, 0, fe2ed31c, cd01de0a) + a1
cd01de34 open_tss_context (fe2ed35c, fee920f5, cd033978, cd01deae) + 38
cd01dec0 token_specific_init (0, 0, fe2ed35c, cd014267) + 20
cd0144b9 ST_Initialize (cd036748) + 33d
cd0093dc C_Initialize (927a610, cd1d8968, fe2ed3e8, cd1d6615) + 158
cd1d6830 pkcs11_slot_mapping (9c32008, 927a610, fe2ed498, cd1d1587) + 22c
cd1d15b3 C_Initialize (927a610) + fb
cd21629a Java_sun_security_pkcs11_wrapper_PKCS11_C_1Initialize (8072d18, fe2ed518, fe2ed514, 100, f7214d18, fe2ed4e4) + 4e
faa0a152 ???????? (0, faa080f9, f71d0ab0, f7214d18, 1, cef09e88)
faa0308d ???????? (0, f7214d18, 0, f71d0ab0, ceef2570, f714c578)
faa02f27 ???????? (0, 0, 0, 0, 0, 0)
faa0308d ???????? (da9e03d8, f70f4640, 8073438, 1fa0, febc8000, 2)
faa00348 ???????? (fe2ed650, fe2ed95c, a, ceee1e98, faa090e0, fe2ed8d0)
fe50fe95 _1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThreadv (fe2ed958, fe2ed728, fe2ed8cc, 8072c00) + 1c9
fe510137 _1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThreadv2468_v (fe50fccc, fe2ed958, fe2ed728, fe2ed8cc, 8072c00) + 27
fe51016f _1cJJavaCallsEcall6FpnJJavaValue_nMmethodHandle_pnRJavaCallArguments_pnGThreadv (fe2ed958, 8073438, fe2ed8cc, 8072c00) + 2f
fe9b776a _1cKReflectionGinvoke6FnTinstanceKlassHandle_nMmethodHandle_nGHandle_bnOobjArrayHandle_nJBasicType_4bpnGThreadpnHoopDesc_ (8073434, 8073438, 8073440, 0, 8073430, e) + c3e
fe562060 _1cKReflectionSinvoke_constructor6FpnHoopDesc_nOobjArrayHandle_pnGThread2 (f70e29f8, 807342c, 8072c00) + 200
fe561c50 JVM_NewInstanceFromConstructor (8072d18, fe2edbac, fe2edba8) + 118
fe21ccbf Java_sun_reflect_NativeConstructorAccessorImpl_newInstance0 (8072d18, fe2edba0, fe2edbac, fe2edba8, f70e2a60, f70e2a48) + 23
faa0a152 ???????? (ce6d1280, faa08116, f70e2a38, f70e29f8, fe2edbb0, ce6d0f88)
faa02f27 ???????? (0, f70e2a38, f70e2a48, fe2edbe4, ce6d167d, fe2edc10)
faa02f27 ???????? (f70e2a38, f70e2a60, fe2edc14, ce646024, fe2edc44, ce6cfaf0)
faa03403 ???????? (0, f70e2a38, f70e29f8, fe2edc48, ce90c3b7, fe2edc80)
faa02f27 ???????? (f70e29f8, 0, ceee5210, da6007f8, f70338b8, 8072c00)
faa00348 ???????? (fe2edcd0, fe2eded0, a, ce90c490, faa090e0, fe2ededc)
fe50fe95 _1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThreadv (fe2edecc, fe2edda8, fe2eded8, 8072c00) + 1c9
fe510137 _1cCosUos_exception_wrapper6FpFpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThreadv2468_v (fe50fccc, fe2edecc, fe2edda8, fe2eded8, 8072c00) + 27
fe51016f _1cJJavaCallsEcall6FpnJJavaValue_nMmethodHandle_pnRJavaCallArguments_pnGThreadv (fe2edecc, 8073428, fe2eded8, 8072c00) + 2f
fe54eaca JVM_DoPrivileged (8072d18, fe2ee0b8, fe2ee0c0, 0, 0) + 34a
fe219f98 Java_java_security_AccessController_doPrivileged__Ljava_security_PrivilegedAction_2 (8072d18, fe2ee0b8, fe2ee0c0, f70338b8, fe2ee090, 0) + 28
faa0a152 ???????? (ce685ce8, faa08116, f70338b8, fe2ee0c4, ce8cde48, fe2ee0f0)
faa02f27 ???????? (0, da9e03a0, 6fb666b1, da6007f8, fe2ee0f4, ce8cdd75)
faa02f27 ???????? (0, 0, 0, da6007f8, 0, da9e03a0)
faa02f27 ???????? (0, 0, da9d1170, fe2ee170, ce8d382d, fe2ee19c)
faa02f27 ???????? (0, da9e0478, fe2ee1a0, ce8d38ba, fe2ee1cc, ce8d3c08)
faa02f27 ???????? (0, da9e0478, f70338a0, 807344c, ce8c9000, fe2ee1fc)
faa9bcb8 ???????? (fec0fd68, 10, 28, ce689750, ce68e148, ce6064d8)
ce600150 ???????? ()


Byron Nevins added a comment - 14/Jan/11 05:54 PM

Nothing has yet worked.


Byron Nevins added a comment - 14/Jan/11 09:04 PM - edited

This fixes the problem

-Dsun.security.pkcs11.enable-solaris=false

maybe not. Look into it further...


Byron Nevins added a comment - 18/Jan/11 02:10 PM

The bug is caused by native code, Security SPI, provided by the JDK. It is too risky to resolve for 3.1

Game Plan:
We need to resolve this within 12 weeks of Solaris 11 introduction.
For initial 3.1 release we document it.


Joe Di Pol added a comment - 18/Jan/11 04:21 PM - edited

More data:

I ran "truss -f -t connect -v connect -p 13064", where 13064 was the pid of the instance I was going to restart. Here is the interesting part:

When things work fine I see:

13076/2: connect(50, 0xFE12D2A0, 16, SOV_DEFAULT) Err#146 ECONNREFUSED
13076/2: AF_INET name = 127.0.0.1 port = 30003
13076/2: connect(50, 0xFE12D220, 16, SOV_DEFAULT) Err#146 ECONNREFUSED
13076/2: AF_INET name = 127.0.0.1 port = 30003
13076/2: connect(50, 0xFE12D2C0, 16, SOV_DEFAULT) Err#146 ECONNREFUSED
13076/2: AF_INET name = 127.0.0.1 port = 30003

When I get the hang I see:

13078/14: connect(280, 0xCD1906A0, 16, SOV_DEFAULT) (sleeping...)
13078/14: AF_INET name = 127.0.0.1 port = 30003
13076/2: Received signal #13, SIGPIPE [caught]
. . .
13076/2: Received signal #13, SIGPIPE [caught]
13078/14: connect(280, 0xCD1906A0, 16, SOV_DEFAULT) Err#145 ETIMEDOUT
13078/14: AF_INET name = 127.0.0.1 port = 30003
13076/2: Received signal #13, SIGPIPE [caught]
13078/14: connect(280, 0xCD190620, 16, SOV_DEFAULT) (sleeping...)
13078/14: AF_INET name = 127.0.0.1 port = 30003
13076/2: Received signal #13, SIGPIPE [caught]
13076/2: Received signal #13, SIGPIPE [caught]
13078/14: connect(280, 0xCD190620, 16, SOV_DEFAULT) Err#145 ETIMEDOUT
13078/14: AF_INET name = 127.0.0.1 port = 30003
13078/14: connect(280, 0xCD1906C0, 16, SOV_DEFAULT) Err#146 ECONNREFUSED
13078/14: AF_INET name = 127.0.0.1 port = 30003

Observations:

  • The connect is trying to connect to port 30003. That's the default tcsd(8) port.
  • When connect is hanging it fails with ETIMEDOUT instead of failing immediately with ECONNREFUSED.
  • It looks like the connect in question is called 3 times.
  • During a "hang" netstat -a shows:
    localhost.59145 localhost.30003 0 0 131072 0 SYN_SENT
  • Netstat shows nothing listening on port 30003

Joe Di Pol added a comment - 18/Jan/11 05:51 PM - edited

sendTCSDPacket() and send_init() are in the Trouser 0.3.4 client code (0.3.4 looks to be the version on my Solaris 11 box). The Trouser 0.3.4 source can be found at: http://sourceforge.net/projects/trousers/files/trousers/0.3.4/. I've attached rpc.c that contains sendTCSDPacket() and send_init(). Nothing seems obviously wrong. Note in particular that send_init() does close the socket on an error.


Byron Nevins added a comment - 24/Jan/11 09:44 AM

In 3.1 this will be documented.
It should be fixed for 3.2


Paul Davies added a comment - 16/May/11 01:08 PM

The Release Notes should just mention the problem of intermittent slow start on Solaris 11. There is no workaround (except, perhaps, to wait).


Scott Fordin added a comment - 16/May/11 06:03 PM

Added issue to 3.1.1 Release Notes.