Skip to main content

Push replication pattern high cpu usage

  15 posts   Feedicon  
Replies: 14 - Last Post: April 20, 2014 16:21
by: pantic
showing 1 - 15 of 15
Posted: December 25, 2013 09:07 by pantic
Hi there,

I've configured Push-Replication-Pattern (using 11.2.0 release) to replicate cache events between two clusters bidirectionally (active-active). I've noticed that the CPU usage is growing a lot in time (> 50% on a sustained basis) due to this replication (I figured that because when I suspended the distribution from EventChannels using JMX, the CPU usage returned to normal usage, that is <5%).
After deeping into the problem, I found that the preempt method in was being called too often in AbstractEventChannelController class. Every time this method is called, a new Task is created and then scheduled by the ScheduledExecutorService linked to the class. As a result of this behaviour, the scheduled task number and the CPU required to manage these tasks are increasing and getting worst in time.
It appears that the reason for calling this preempt method is the listener on Subscriptions cache (from messaging pattern) reporting updates but I'm not updating this cache so I've decided to comment out this code in onCacheEntryLifecycleEvent method of CoherenceEventChannelSubscription class:
        else if (mapEvent.getId() == MapEvent.ENTRY_UPDATED)
        {
            // when a subscription is update we assume we need to schedule distribution
            final EventChannelController controller = manager.getEventChannelController(distributorIdentifier,
                                                                                        controllerIdentifier);

            if (logger.isLoggable(Level.FINER))
            {
                logger.log(Level.FINER,
                           "Scheduling the EventChannelController for {0} to distribute available events.",
                           new Object[] {this});
            }

           
            if (hasVisibleMessages())
            {            	
                controller.preempt();
            }
        }

Using this workaround I'm happy because the CPU usage is acceptable and the events distribution is being performed as desired but I'm worried about the code I commented out because I suppose it is intended to do something useful. I would like to know why is the listener reporting update events and why is necessary to schedule a new distribution when these update events are reported.

Thanks for your time and patience.
Posted: January 23, 2014 08:18 by brianoliver
That's an interesting observation.

The reason the update listener there is to ensure that should an EventChannelController be "asleep" (ie: waiting) that it will be "woken up" should a new event arrive.

It's possible that the way you've configured the EventChannelControllers this may never happen - ie: they are routinely woken up or never sleep, in which case this logic would cause additional and unnecessary CPU.

Perhaps we should investigate an optimization here where by when an EventChannelController is scheduled to publish that we don't preempt it. This would avoid the unnecessary scheduling of work.

-- Brian

Posted: January 28, 2014 20:51 by brianoliver

As a quick heads up, we've now raised the following issue to track progress:

https://java.net/jira/browse/COHINC-83

Regards

-- Brian

Posted: January 30, 2014 15:42 by brianoliver

Great news.

This has been resolved in the latest Coherence Incubator develop-11 and develop-12 branches.

Details are available here:

http://coherence-community.github.io/coherence-incubator/11.2.1-SNAPSHOT/

http://coherence-community.github.io/coherence-incubator/12.1.1-SNAPSHOT/

-- Brian

Posted: February 06, 2014 05:21 by pantic
Indeed these are fantastic news!!!. I'm looking forward to try this code ASAP.

Nice job!!!.

Thank you so much!!!.
Posted: February 06, 2014 14:12 by pantic
Hi Brian,

I forgot to mention previously: Is it possible to know when the next production-ready version will be released?.

Thanks
Posted: March 13, 2014 21:18 by brianoliver
After much log watching and profiling I've made a few slight changes to the earlier solution that demonstrates even greater improvements.

See: https://java.net/jira/browse/COHINC-93

This is now in Incubator 11 (snapshot) (ie: develop-11 branch). I'll also make the same changes in Incubator 12.

-- Brian

PS: I've seen at-rest CPU drop from 25% to 0% on my test machines. I've also seen the build and functional test time drop by 50%.
Posted: March 14, 2014 11:43 by pantic
Of course, I appreciate your effort and every improvement in this area is welcome but we really need to know something about your future plans regarding the next production-ready release generation estimated date.

Thanks.
Posted: March 14, 2014 12:25 by brianoliver
For the most part, we release when we've received feedback that an issue is resolved. The development process is pretty simple.

1. We try to work closely with developers when they identify issues.

2. If we're lucky, sometimes developers provide us with fixes, sometimes they don't. If they do, we review the changes carefully before integrating them, or alternatively, rewrite them (with the developer) to ensure that nothing else breaks.

3. Often we have to invest a lot of time trying to isolate where an issues lies. Is it in Incubator code? Is it in Coherence? Is it in Java? Is it in the Operating System? Is it how it's deployed? Is it a configuration issue? Is it in application code? Is it due to a customization / extension of the Incubator code?

4. Often we have very little information. Sometimes we don't even have a stack trace, so we do the best we can, especially if we can't reproduce an issue. In a lot of circumstances it's not in the Incubator, but elsewhere. Regardless we work to resolve the issue, even if it sometimes means writing new application code (or worst case, sending people on site - it's pretty rare but we've done that too).

5. When we make a fix, we make it publicly available. It's at this point in time we ask the developers in the community to test that the solution resolves the issue. This is especially important as we don't have the "production" or "test" environments where developers run their applications. ie: we often don't have access to their code base, their servers, their networks etc. For something like Push Replication it's even more challenging as we don't have access to their multiple data-centers!

6. Once developers are happy with the fix, (they give us feedback), we push out a release and make an announcement. Even with all of our new automation, doing a release takes about half a day. That doesn't include time to ensure our multiple branches (develop-11 and develop-12 are in sync).

By the time we run all of the builds (across multiple platforms), sign the artifacts, push the resources to maven.java.net, promote the builds from staging to production and then have the builds pushed out to Maven Central, there's not much time left in a day. This is in addition to the other administrative tasks we do, like letting the community (and Oracle) know. So given how long this takes, and that there's "no going back" or "undoing" a release, it's important that releases are as good as possible.

Hence we ask for feedback. To us, this is an important part of the community process. If we don't get any feedback, and there's a developer request to make an official release (we'll usually request that a developer confirms it's ok), we'll do a release. Alternatively we'll push out a release in case of an emergency, but again, that's usually when it's been tested by developers outside of our own testing process. Luckily we rarely have emergencies as we take our time to prepare releases (and test them).

The great things about the new licensing mode (CDDL) and that we now host source code on github is that everyone can see what is happening. Everyone has full access to all of the fixes as they are being made. Furthermore, anyone can make an official release themselves, into their own corporate repositories, without our permission, as and when they are required. Of course, some choose to wait for us to make official releases.

For this issue we're currently testing the solution with multiple companies. We're hoping to get feedback in the next few days so that we can make official releases (for Incubator 11 and 12).

Hope this helps.

-- Brian
Posted: March 14, 2014 14:22 by pantic
Brilliant!!!. We're going to test this new version in our Testing environment and provide feedback ASAP.

Thanks a lot.
Posted: March 14, 2014 17:41 by pantic
Hi Brian,

unfortunately, as you can see, I have no good news Frown.

Cluster CPU load: [http://i.imgur.com/sDGq9EI.png]

Live Threads CPU usage: [http://i.imgur.com/KJIfxM8.png]

These are the results we get with the workaround mentioned in first post applied.

Cluster CPU load:
[http://i.imgur.com/dZ0NGBU.png]

As you can check, the difference is remarkable.

Thanks
Posted: March 17, 2014 20:42 by brianoliver
While those graphs don't reconcile what we're seeing, it's clear something is going on in your environment that we can help out with. To do so we need a far greater amount of information.

Can you upload the following to the related Oracle Support issue.

1. A complete description of your topology. The types of physical server(s), the type(s) of operating systems, including patch levels, the version(s) of Java, the version(s) of Coherence and the version(s) of Coherence Incubator libraries your using.

2. Every configuration file (for every node), not just Coherence configuration files.

3. All scripts used to start/stop your application Coherence.

4. All environment variables and system properties that are used in your applications.

5. All code that you've customized from the Incubator.

6. If you have a test-case demonstrating this, that would also be helpful.

After this we'll schedule a call to walk through the information.

Thanks.

-- Brian
Posted: March 18, 2014 15:36 by pantic
Hi Brian,

we are gathering the information you requested in order to update the SR.

Thanks
Posted: April 14, 2014 15:28 by brianoliver
Hi,

Over the weekend we released Incubator 11.3.0 and 12.2.0 that so far demonstrates significant reduction on CPU and memory utilization for small and large clusters (over 40 servers).

Please take a look at the new implementation and let us know if it solves the issue you're seeing. We want to make sure this is resolved.

Thanks

-- Brian
Posted: April 20, 2014 16:21 by pantic
Hi Brian,

we tested this release and It appears that the issue has been solved. When we start both clusters from a fresh start, we preload the cache from database and the events are pushed out to the other Coherence cluster. This is the reason (we think) to see high cpu usage in both clusters at the beginning. Once the pushing of "preloading events" has been accomplished, cpu usage return to normal/acceptable figures. So, we think the problem is finally solved and we are asking our customer for the permission to close the SR.

Here are the results Smile:

[http://i.imgur.com/WBiGjoy.jpg]

Thanks a lot


showing 1 - 15 of 15
Replies: 14 - Last Post: April 20, 2014 16:21
by: pantic
 
 
Close
loading
Please Confirm
Close