[GLASSFISH-18987] [PERF] ServiceHandle iterator has huge performance regression compared to injection Created: 08/Aug/12  Updated: 04/Mar/13  Resolved: 04/Mar/13

Status: Closed
Project: glassfish
Component/s: hk2
Affects Version/s: 4.0_b48
Fix Version/s: 4.0

Type: Bug Priority: Major
Reporter: Scott Oaks Assignee: Scott Oaks
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Tags: PSRBUG

 Description   

In our EJB benchmarks, we are presenting spending about 15% of our time in InvocationManagerImpl.preInvoke and postInvoke. Virtually all of this time is spent waiting for a lock in org.jvnet.hk2.internal.ServiceLocatorImpl.getAllServiceHandles as it iterates through the component invocation handlers.

Presumably the service handles don't change frequently and this lock could be changed to a read-write lock? Unless they don't change at all and we could just initialize once...



 Comments   
Comment by jwells [ 09/Aug/12 ]

I put in a fix that should help reduce (but not eliminate) lock contention. The next time HK2 is promoted it would be good to see how my fix affects the use case.

Actually, it'd be good to see how it affects the use case even prior to that, by setting the hk2 version to 2.1.20-SNAPSHOT and seeing the effect. Would doing that be possible?

Comment by Scott Oaks [ 09/Aug/12 ]

We aren't set up to build the code, so I can't set the pom version (if that is what you are suggesting). I could drop in the new binary, but I don't see any binary downloads on the hk2.java.net page. Is there somewhere to get the snapshot jars?

Comment by jwells [ 09/Aug/12 ]

Unfortunately there is no-place to get the hk2 SNAPSHOT versions. I guess it can wait until the next promotion, which should be sometime tomorrow I think.

Comment by Scott Oaks [ 11/Sep/12 ]

Testing now on build 53; the same root performance problem exists: we are spending 3-5% of total processing time in ServiceLocatorImpl.getAllServiceHandles (as called from the InvocationManagerImpl.preInvoke and .postInvoke). In previous versions, the same iteration call didn't show up in the profiles.

However, the underlying issues have been clarified somewhat – probably because I am using a different profiler. About 60% of that time is from ServiceLocatorImpl.getDescriptors, and most of that time is spinning on the lock there (and I don't actually see a change in this code from 2.1.17 to 2.1.20).

The other 40% of that time is spent in ServiceLocatorImpl.narrow, all around calculating (re-calculating, I'm guessing, since again we aren't adding services at this point) the hash code for the annotation types and manipulating the hash sets.

Comment by Mahesh Kannan [ 11/Sep/12 ]

Assigning to John

Comment by Scott Oaks [ 28/Sep/12 ]

We're testing the lastest nightlies with the new lock now; the overhead of iterating through the service handles is still quite high. Profiles are reporting lots of time in the LRUCacheImpl.

I am unsure if the cache is working as expected. I traced through the HK 2.1.32 source, and the ServiceLocatorImpl.internalGetAllServiceHandles method never seems to put anything in the cache – the cache.put logic seems backwards, as it will only put something in the cache if it was first retrieved from the cache.

Comment by jwells [ 28/Sep/12 ]

Yes, I fixed that yesterday. So should be a little better now.

Comment by Scott Oaks [ 28/Sep/12 ]

Well that's good – because there was actually a huge new regression with the 2.1.32 code.

So I grabbed the 2.1.33 code and tested with that. Now we are back to a 12-15% regression, even with the cache. All of the regerssion comes from initializing the service iterator. Now the problem just seems to be that it is executing a lot of code – calculating filters, manipulating the cache, validating the returned list, and so on. That code execution path (IterableProviderImpl.iterator()) accounts for 12% of execution time, with internalGetAllServiceHandles taking 9% of execution time (remaining time in initializing the list, copying the array, and so on). Within internalGetAllServiceHandles, time is pretty well evenly distributed among a number of other methods (including that method itself), though of course since we are sampling, it becomes difficult to track down into the leaves exactly.

Caching is likely the right answer here, but maybe we need something easier and/or at a different level. Just to check, I put the list returned from internalGetAllServiceHandles into a map keyed by the contract type and returned that when possible, 90% of the regerssion went away (and the remaining regression is because the iterator at later points still call internalGetServiceHandle, which has the same performance issues as internalGetAllServiceHandles). I understand that isn't semantically valid for all cases; my purpose was just to prove to myself what the limits here are. Even if we have to specifically optimize the iterator path, we should be able to make up all the regression from when we moved from going through the injectable array to using the new iterator.

Comment by jwells [ 28/Sep/12 ]

Right, a large part of the issue here is the validation. The real question is whether or not we truly need to do it on each and ever lookup, or if there is some way to change that. Currently we validate every lookup operation, with the idea that the validating code (i.e. security) can start returning a different answer at any time. Perhaps we need to loop back with the security team and see if we can somehow relax this requirement.

Comment by Scott Oaks [ 02/Oct/12 ]

It's not clear from the profiles that a large part of the issue is the validation; it would be good to reduce the cache lookup to just the validate call if that is necessary. Also, I have to think that security cannot arbitrarily start returning a different answer, but that the different answer must be in response to some external event (an administrative event, or a timer expiry, or something). So even if the validation is the root of the issue, I should think that we ought to be able to optimize the common case and return to the performance we had where the list was simply injected. [And in that case, there was no validation on each call -- so I guess this validation is a new requirement?]

Comment by jwells [ 02/Oct/12 ]

Well, the security validation result could change based on the time of day. Like, after 5:00PM you have no more access to this service. That sort of thing. The other issue is that the ServiceHandle's passed out must be new every time, since things like PerLookup scope (or potentially any user scope) need to create fresh instances on every lookup (that's the meaning of "PerLookup"!). That being said there still may be something we can do, for example it is very often the case that people are doing getAllServices rather than getAllServiceHandles, in which case we may be able to save some of the time creating the intermediate service handles. (Actually, which case is yours? Is the code we are measuring doing an "getAllServiceHandles" or a "getAllServices" at the top)?

Also I noticed based on your comment above that in the case that we are getting only one answer that there was no need to keep going through the list once we got the answer we were looking for. So the loop in the "internalGetServiceHandle" should now NOT be looping and should therefor show some improvement. Can you check to see if that is so? That got published into glassfish yesterday.

I was going to do a little refactoring of the code for the getAllServices case to see if it helps.

Comment by jwells [ 02/Oct/12 ]

I have played a little dirty pool and at least reduced the use of the IterableProvider in InvocationManagerImpl. I still want to fix the underlying cause, but I think doing this should at least reduce some of the regression...

Comment by Scott Oaks [ 03/Oct/12 ]

Updating the subject since we now understand that the regression is in the iteration itself and not just contention.

I've run some tests with the new InvocationManagerImpl, and as we would expect, it has brought down the regression by about 50% since we make only half as many calls to the iterator.

On some newer tests, we started with a 42% regression from this; it will be down to 21% when we test the promoted build with this fix.

Comment by jwells [ 03/Oct/12 ]

So I looked at the old old version of InvocationManagerImpl and realized that in the olden days HK2 used to inject a static array that would not change based on changes to the registry. Hence, before we moved to the new HK2 this code NEVER actually did an HK2 operation (and hence would never have picked up dynamic changes, but hey, that's the way the code used to work...)

So I am going to make yet another change to InvocationManagerImpl to work more like the old way it worked, which should remove that entire regression (at least any part of it that comes from hk2).

However, since you renamed the Bug, and since I guess that this use case is still somewhat slower than the old hk2 I guess we can leave this bug open...

Comment by Scott Oaks [ 14/Nov/12 ]

We are looking at some different test cases, which are showing us essentially the same thing but in a different context: ServiceLocatorImpl.getService() is proving to be far slower than Habitat.getComponent().

For example, if we create and remove stateful beans in 4.0, there is about a 5-10% regression vs. 3.1.2. This all tracks back to extra time spent in getService() vs getComponent(). In this particular use case, the calls to getService come from four locations: two calls in com.sun.enterprise.naming.impl.SerialContext.<init> (the first to retrieve the ProcessEnvironment, the second to retrieve the common class loader), and two calls in com.sun.ejb.containers.BaseContainer – one in createEjbInstanceAndContext and one in injectEjbInstance, each of which is retrieving the JCDI service.

Part of the regression is that there is still some lock contention on the service LRU cache. The remaining lock contention could be somewhat lessened if the CacheKey were constructed outside of holding the lock, and if the constructor for the CacheKey generated the hashcode and stored it in an instance variable to be returned by the hashCode() method. I did that and shaved the regression down by 33% (to 6.6%). [Though to be clear, this is a pretty focused micro-benchmark, and lock contention points are greatly exaggerated by it.]

I changed the glassfish code for the four calls in question so that the first call saved the service as a static and subsequent calls used that; that eliminted the remaining regression for this test case. I'm still not quite clear on the semantics here; I guess that since these are not injected it is not similar to the InvocationManagerImpl case, and possibly in this case it isn't legal to cache the service retrieval like this. Still, the root of the iterator case seems to be this as well: the new, somewhat slower calls to the service locator. [Though if I'm wrong and it should be a different bug, let me know and I'll open another one.]

Comment by jwells [ 16/Nov/12 ]

I've done two of the three things suggested above.

1. pre-calculate hashCode and construct the CacheKey outside the lock
2. Modified BaseContainer to only look up the service in the constructor

The last thing (modifying SerialContext to use a static cache) does not seem right to me, as it is possible these services are called in a different context in that case and hence these services may be different on different calls.

Item #1 has not yet been merged into GlassFish. #2 is a GlassFish change. I'd like to see how the numbers look after #1 has been merged back into GlassFish. I'll work on getting that merged back in today.

Comment by jwells [ 04/Mar/13 ]

I am closing this because we have fixed the specific issue described in this bug. If there are other instances of the old hk2 using an [] as an injection point that we can fix lets open other bugs about the specific issues.

Generated at Tue Apr 21 06:18:00 UTC 2015 using JIRA 6.2.3#6260-sha1:63ef1d6dac3f4f4d7db4c1effd405ba38ccdc558.