[GLASSFISH-18763] EJB bundle hangs on stopping when the bundle is updated Created: 24/May/12  Updated: 19/Jun/12

Status: Open
Project: glassfish
Component/s: OSGi-JavaEE
Affects Version/s: 3.1.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: pbakker Assignee: Sanjeeb Sahoo
Resolution: Unresolved Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Java Archive File agenda.api.jar     Java Archive File agenda.service.simple.jar     Java Archive File agenda.storage.ejb.jar     Text File threaddump.txt    

 Description   

Bundles that contain EJBs deadlock on "stopping" when the bundle is restarted for an update. Glassfish has to be killed (shutdown doesn't work) and the bundle cache cleaned to resolve the issue.
Probably not related, but the update is triggered by DeploymentAdmin (Apache ACE). Other bundles (non-ejb) work without problem.

The attached bundle is an example of a bundle that always hangs when updated.



 Comments   
Comment by pbakker [ 24/May/12 ]

I did a thread dump of the Glassfish process and found the following that might be related. I also attached the full thread dump.

"FelixFrameworkWiring" daemon prio=5 tid=0000000001f00000 nid=0xb4e1d000 waiting for monitor entry [00000000b4e1c000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.glassfish.osgijavaeebase.OSGiContainer.isDeployed(OSGiContainer.java:218)

  • waiting to lock <00000000142e6890> (a org.glassfish.osgijavaeebase.OSGiContainer)
    at org.glassfish.osgijavaeebase.JavaEEExtender$HybridBundleTrackerCustomizer.removedBundle(JavaEEExtender.java:186)
    at org.osgi.util.tracker.BundleTracker$Tracked.customizerRemoved(BundleTracker.java:508)
    at org.osgi.util.tracker.BundleTracker$Tracked.customizerRemoved(BundleTracker.java:424)
    at org.osgi.util.tracker.AbstractTracked.untrack(AbstractTracked.java:352)
    at org.osgi.util.tracker.BundleTracker$Tracked.bundleChanged(BundleTracker.java:464)
    at org.apache.felix.framework.util.EventDispatcher.invokeBundleListenerCallback(EventDispatcher.java:868)
    at org.apache.felix.framework.util.EventDispatcher.fireEventImmediately(EventDispatcher.java:789)
    at org.apache.felix.framework.util.EventDispatcher.fireBundleEvent(EventDispatcher.java:514)
    at org.apache.felix.framework.Felix.fireBundleEvent(Felix.java:4244)
    at org.apache.felix.framework.Felix.stopBundle(Felix.java:2351)
    at org.apache.felix.framework.Felix$RefreshHelper.stop(Felix.java:4629)
    at org.apache.felix.framework.Felix.refreshPackages(Felix.java:3951)
    at org.apache.felix.framework.FrameworkWiringImpl.run(FrameworkWiringImpl.java:172)
    at java.lang.Thread.run(Thread.java:680)

Locked ownable synchronizers:

  • None
Comment by Sanjeeb Sahoo [ 25/May/12 ]

This may be related to GLASSFISH-18159 . Can you try the following:

1. download http://search.maven.org/remotecontent?filepath=org/glassfish/fighterfish/osgi-javaee-base/1.0.2/osgi-javaee-base-1.0.2.jar
2. make sure you copy it over glassfish/modules/osgi-javaee-base.jar - you can take a backup of original file if you like.
3. retry the operation that causes hang.

Thanks much for reporting,
Sahoo

Comment by pbakker [ 25/May/12 ]

After updating the osgi-javaee-base bundle the problem is slightly different.

The EJB bundle restarts correctly now, but instead a plain OSGi bundle that uses the service published by the EJB bundle now deadlocks on stopping with the exact same thread dump.
I also attached the bundle that now deadlocks.

Comment by Sanjeeb Sahoo [ 25/May/12 ]

Pl. mention the exact steps needed to reproduce the problem using a glassfish installation. Thanks much.

Comment by pbakker [ 25/May/12 ]

We had a look at the code that deadlocks and found some major issues that can easily lead to deadlocks.
It's the org.glassfish.osgijavaeebase.OSGiContainer class.

The problem is that the deploy/undeploy methods are synchronized, even while services are registered and looked up. This is also not allowed by the OSGi spec.

Chapter 4.7.3:

"Synchronization Pitfalls
Generally, a bundle that calls a listener should not hold any Java monitors. This means that neither the Framework nor the originator of a synchronous event should be in a monitor when a callback is initiated.
The purpose of a Java monitor is to protect the update of data structures. This should be a small region of code that does not call any code the effect of which cannot be overseen. Calling the OSGi Framework from synchronized code can cause unexpected side effects. One of these side effects might be deadlock. A deadlock is the situation where two threads are blocked because they are waiting for each other.
Time-outs can be used to break deadlocks, but Java monitors do not have time-outs. Therefore, the code will hang forever until the system is reset (Java has deprecated all methods that can stop a thread). This type of dead- lock is prevented by not calling the Framework (or other code that might cause callbacks) in a synchronized block.
If locks are necessary when calling other code, use the Java monitor to create semaphores that can time-out and thus provide an opportunity to escape a deadlocked situation."

Comment by Sanjeeb Sahoo [ 25/May/12 ]

Yes, I know OSGiContainer methods are synchronized, but I didn't expect them to deadlock during normal course of action. It's a different matter if there are multiple management agents managing life cycle of bundles.
BTW, I am unable to deploy your bundles because you have not attached the bundle that provides agenda.api package. Pl. provide the complete set of bundles that I need and the instructions to reproduce. Thanks.

Comment by marrs [ 30/May/12 ]

Calling services or the OSGi framework itself whilst holding locks is a very bad idea in general, because you do not know the exact consequences of such calls and I have seen many examples where this caused deadlocks. In this case it's particularly bad as it hangs the whole framework. This has nothing to do with having multiple management agents, if you ask me, and I would advise you to refactor the code so it no longer holds any locks while calling the framework. In this case, I don't think a test is that important as in general unit and integration tests are very bad at spotting concurrency issues anyway. These are best resolved by code reviews (at least that's my opinion). However, if you insist I'm sure Paul can come up with a working test case.

Comment by Sanjeeb Sahoo [ 31/May/12 ]

Yes, please provide a complete test case as earlier requested. Thanks.

Comment by pbakker [ 31/May/12 ]

I have spent a few hours creating a test case as simple as possible. The problem is I keep on seeing different results, and it often doesn't brake (but sometimes it does). Although not the simplest way, but the most effective way to test this is by using Apache ACE to deploy bundles.

1) Download and install ACE (ace.apache.org).
2) Add two system properties to glass fish (e.g. in the domain.xml)
<system-property name="discovery" value="http://localhost:8080"></system-property>
<system-property name="identification" value="glassfish"></system-property>
The discovery url is the url where ACE is running.
3) Upload an EJB bundle to ACE and deploy it to GF
4) Upload a new version of the EJB bundle (just change the filename and version in the manifest)
5) "Save" the new version in the ACE UI so it will be pushed to GF
6) The EJB bundle is now deadlocked in "stopping" in most cases, if not, just update again.

Sorry the test case isn't that easy to execute, but as marrs said it is often hard to spot concurrency issues from automated tests because timings are very different.
You can use the ACE management agent without ACE too by using a file url in the discovery property, but this test fails a lot less often (on my machine). It may be hard to reproduce the issue that way. The file url should specify a directory where you "install" bundles. Start with one EJB bundle and let it deploy, then add a second bundle to the directory with a higher version in the manifest.
<system-property name="discovery" value="file:///Users/paul/Desktop/glassfish/"></system-property>

Comment by Sanjeeb Sahoo [ 31/May/12 ]

Thanks for this instructions. I will try them out. Although I have many osgi/ejb bundles with me, I would rather use what you used to reproduce than using my own. I had mentioned in my comment on 25 May that the set of bundles you have attached to this issue does not include some required bundles and I had asked for the missing bundles to be attached. Could you do the same? Thanks again.

Comment by pbakker [ 31/May/12 ]

I have attached the agenda.api bundle, you should now be able to deploy them. Thanks for looking into it!

Comment by Sanjeeb Sahoo [ 12/Jun/12 ]

Pl. accept my apology for not investigating yet. I am busy in another higher priority task and hope to get back to this early next week. Thanks for your patience.

Comment by Sanjeeb Sahoo [ 19/Jun/12 ]

I am using osgi-javaee-base 1.0.2 which has some deadlock fix. I have tried deploying and updating your ejb bundle several times and I could not reproduce. Knowing timing issues, I am not surprised.

I am afraid I am not very inclined to change the locking model unless I know what's exactly going on. Lack of any framework API to lock a bundle does not make things easier. As per the requirements of WAB and other enterprise applications spec, when an enterprise bundle is stopped, the extender must undeploy synchronously upon receiving the Bundle.STOPPING event, which means there is some amount locking that needs to happen in a synchronous listener. On the other hand, the spec allows for bundle to be deployed asynchronously. So, we had seen some deadlocks when bundles were started and stopped in quick succession. Those deadlocks were (successfully) broken by introduction of a timeout and cancellation facility in our undeployer thread. We did that in osgi-javaee-base:1.0.2 when we fixed GLASSFISH-18159. The original thread dump you have noticed should not occur after our fix. You have also mentioned that after upgrading to osgi-javaee-base 1.0.2, the behavior slightly changed. Could you tell me how the behavior changed? What's the new thread dump? Part of the reason for me being reluctant to make drastic changes is that we have other people using it and they seem to be fine. So, I would like to get to the root of the current problem before really doing something of that sort.

Generated at Tue Sep 27 20:27:11 UTC 2016 using JIRA 6.2.3#6260-sha1:63ef1d6dac3f4f4d7db4c1effd405ba38ccdc558.