Issue Details (XML | Word | Printable)

Key: GLASSFISH-16311
Type: Improvement Improvement
Status: Open Open
Priority: Critical Critical
Assignee: Byron Nevins
Reporter: Tom Mueller
Votes: 2
Watchers: 5
Operations

If you were logged in you would be able to see more operations.
glassfish

Improve operating service (OS) integration

Created: 04/Apr/11 09:26 AM   Updated: 17/Oct/12 08:19 PM
Component/s: admin
Affects Version/s: 3.1
Fix Version/s: future release

Time Tracking:
Not Specified

Environment:

Windows, Linux, Solaris

Issue Links:
Dependency
 
Duplicate
 

Tags: ee7ri_cleanup_deferred
Participants: Bill Shannon, Byron Nevins, jclingan, mkarg and Tom Mueller


 Description  « Hide

This RFE is for improving the operating system (OS) service integration for GlassFish. Here are the requirements:

1. Expose the asadmin delete-service interface as a public interface (rather than being hidden as it is currently).

2. Modify the service implementation so that it acts as a monitor on all operating systems. By monitor, this means that if the service is started, then if the GlassFish process exits, it should be restarted automatically. The Windows service currently doesn't work this way - I'm not sure for Linux and Solaris.

3. Modify the service implementation and/or the various start/stop commands so that they interact correctly. This includes:

3a. If the OS service is started, and the user runs stop-local-instance or stop-domain, then the OS service should be stopped too.

3b. If the server is started using start-domain or start-local-instance, and the OS service is not started, the OS service should not be started. However, if the user then later starts the OS service, the OS service must recognize that the server is already running and monitor the already running server (See req. #2).

4. Any sequence of OS service commands and asadmin start/stop-domain or start/stop-local-instance commands must not cause a failure of the command. For example, currently, if you start the Windows OS service for a domain, and then run stop-domain, and then try to stop the Windows OS service, you get an error message from Windows. Also on Windows, if you run start-domain, and then start the OS service, you get an error message saying that "The process terminated unexpectedly." and the service isn't started.

5. The service should be able to be deleted using either the operating system command for deleting services or the asadmin delete-service command. So the following sequences should work:

asadmin create-service
OS delete service command
asadmin create-service
asadmin delete-service

Currently, the 2nd create-service in this sequence may require a --force option. Ideally, it shouldn't.

6. These requirements apply to the officially supported operating systems for GlassFish for Windows, Linux, and Solaris.



jclingan added a comment - 04/Apr/11 12:43 PM

Good RFE write-up. One additional bullet:

6) Currently we have a feature gap on Linux, where no "watchdog" or "monitor" role is offered using our linux service template. Upstart is available on RHEL 6 and OL 6 (timeline for SuSE Enterprise Linux is TBD). We should investigate if an upstart job definition can be created to fulfill the watchdog role on supported OSs. As a heads-up, it looks like Fedora may move to systemd in the future, so watchdog approach may change in the future.


Byron Nevins added a comment - 04/Apr/11 01:26 PM

Problems with #2
=================== problem #1 ================
Mainly the problem of getting into an infinite loop trying to start an unstable server. Windows allows you to set what to do –
You can set any of the allowed 3 occurrences to handle any of the 4 allowed responses.

1. First Failure
2. Second Failure
3. Subsequent Failures

1. restart the service
2. reboot
3. run some other program
4. ignore it

As you can see just figuring out how to allow users to configure these options and then implementing in 3 main platforms that are totally different is quite a task!

Of course – why is the server crashing? I have V2 running at home now for several years. It never crashes. Do we really want to do all of this complicated and expensive work for something that should be exceedingly rare?

======= problem #2 =========== (This caused many support problems with V2)
User kills the server forcefully (e.g. using "kill" or "taskkill.exe")
A moment later it is running again.
He scratches his head kills it again
A moment later it is running again.
"Hello. Customer support? ...."


Byron Nevins added a comment - 04/Apr/11 01:29 PM

Comment from Bill Shannon about #3B

I agree about 3b. If you start the server without using the OS service
mechanism, you're probably not going to be able to get the service mechanism
to monitor that service later. Mostly this should be as expected. The key
is to get stopping and restarting to integrate properly with the service
mechanism. Possibly also start-instance. I'm not sure we want to make
start-local-instance or start-domain just be front-ends to the service
mechanism (if the server is configured to be handled by the service mechanism).


Bill Shannon added a comment - 04/Apr/11 01:36 PM

Ok, apparently we're going to be using Jira as a discussion forum...

For problem #1, we should just pick some particular combination of
options and allow users to customize it using the OS-specific mechanisms.

For problem #2, this is the normal behavior for any service right?
If the user chooses to have the service mechanism manage his server,
this is what he should expect, and what he would get with any other
server managed by the service mechanism, right? For that matter,
this is what he would get with the old node agent - kill the server
instance and the node agent would restart it because it "crashed",
right?


Tom Mueller added a comment - 05/Apr/11 12:40 PM

The changes to OS service integration should also implement those suggested in issue 11692.


Tom Mueller added a comment - 05/Apr/11 12:47 PM

This issue should resolve issue 16140 also.


Byron Nevins added a comment - 15/Apr/11 02:35 PM

Byron Nevins added a comment - 15/Apr/11 02:53 PM

Pasted the description here. Lines that start with **** are my comments.

1. Expose the asadmin delete-service interface as a public interface (rather than being hidden as it is currently).

        • Yes - Also add start|stop|list

2. Modify the service implementation so that it acts as a monitor on all operating systems. By monitor, this means that if the service is started, then if the GlassFish process exits, it should be restarted automatically. The Windows service currently doesn't work this way - I'm not sure for Linux and Solaris.

        • Yes. I will choose reasonable defaults for each platform – as long as the platform supports it.

3. Modify the service implementation and/or the various start/stop commands so that they interact correctly. This includes:

3a. If the OS service is started, and the user runs stop-local-instance or stop-domain, then the OS service should be stopped too.

        • Misunderstanding of what a Service is. GF-instance and the "service" are the same thing, sort of. Once the service has started - when you stop the server you have stopped the "service". Compare to a SMTP server. When you stop the SMTP server process you have also stopped the SMTP "service".

3b. If the server is started using start-domain or start-local-instance, and the OS service is not started, the OS service should not be started.
****It can't be started. GF won't allow the same server to be started twice.

However, if the user then later starts the OS service, the OS service must recognize that the server is already running and monitor the already running server (See req. #2).

        • Impossible - will not do.

4. Any sequence of OS service commands and asadmin start/stop-domain or start/stop-local-instance commands must not cause a failure of the command. For example, currently, if you start the Windows OS service for a domain, and then run stop-domain, and then try to stop the Windows OS service, you get an error message from Windows. Also on Windows, if you run start-domain, and then start the OS service, you get an error message saying that "The process terminated unexpectedly." and the service isn't started.

        • This is all expected and what we want!

5. The service should be able to be deleted using either the operating system command for deleting services or the asadmin delete-service command. So the following sequences should work:

asadmin create-service
OS delete service command
asadmin create-service
asadmin delete-service

Currently, the 2nd create-service in this sequence may require a --force option. Ideally, it shouldn't.

      • will do

6. These requirements apply to the officially supported operating systems for GlassFish for Windows, Linux, and Solaris.

      • indeed.

Byron Nevins added a comment - 15/Apr/11 02:57 PM

Just thinking out loud here.

1) services actually call

asadmin start-XXX --verbose

2) if the server crashes – asadmin knows about it and is capable of restarting it itself without the platform doing anything at all!

3) if the server is stopped in an orderly way, asadmin knows this also and it can tell the difference from a crash.

==============


Byron Nevins added a comment - 27/Apr/11 11:02 AM

Please see the One Pager:

http://wikis.sun.com/display/GlassFish/3.2PlatformServices

Note that I have added a feature – we will support services on all GlassFish-supported Platforms.


Byron Nevins added a comment - 02/May/11 11:26 AM

THIS IS THE UMBRELLA ISSUE FOR IMPROVED PLATFORM SERVICES for 3.2


mkarg added a comment - 28/Jul/11 01:44 PM

I want to recommend not having two different JVMs involved or using scripts at all on Windows to implement services.

See, on Windows, a real service is implementing an API defined by Microsoft, which lets Windows monitor the service on its own - there is no need for an additional Watchdog, as Windows is a service watchdog (it even comes with configurable rules what to do when the service fails and can restart it etc)! This is the most clean solution and it could be done very easily by just a few lines of JNA code within the GlassFish kernel.

See http://msdn.microsoft.com/en-us/library/ms685141(v=vs.85).aspx


mkarg added a comment - 28/Jul/11 01:56 PM

I want to suggest that asadmin create-service provides a slightly changes configuration:

  • Since Windows 2008 there are special account types for services due to security reasons. I want to suggest that the service is not created to be run as the local SYSTEM account (= with highest possible access rights) but instead the installer should create a local service type account and register on the most essential access rights with that. In a productive system it is not appropriate to run as local SYSTEM account, and the administrator doesn't know what access rights GF will need, so he cannot change it.
  • Windows has a built-in watchdog facility. The configuration should be made up in a way that automatically restarts after first fail, runs some kind of domain repair at the second fail (if there at all is something that GF can repair), and restarts the host at the third fail. The failure counter should be reset after one week. This stuff already is there, so please use it.

Windows is Windows, not UNIX plus a GUI.


Byron Nevins added a comment - 28/Jul/11 07:57 PM

"it could be done very easily by just a few lines of JNA code"

Please to provide these very easy few lines of code.


Bill Shannon added a comment - 28/Jul/11 09:29 PM

Markus, this seems to be important to you, and you seem to know more about it
than most of us. Perhaps you'd be interested in implementing it and contributing
it to the GlassFish community? While we'd all like to see these sorts of
improvements, I doubt that we would do as good a job as you would.


mkarg added a comment - 29/Jul/11 06:12 AM

Bill,

I would be happy to provide an implementation, but I have no strong GlassFish internals background, so I will need help with that. What I can provide is the complete JNA or C++ code for a "good and complete" "real" Windows service, but what I need it someone that will tell me (a) how to build GlassFish from scratch and (b) the few Java lines needed to issue a GF startup / shutdown / status-query. If we can organize this then I will be glad to provide the complete Windows part.

Regards
Markus


Bill Shannon added a comment - 29/Jul/11 06:11 PM

Ideally you would just issue the "asadmin start-domain" command to
start the app server. Is there some reason you need to start it
"in process"? If so, you'll need to duplicate the environment setup
from asadmin.bat and then start a JVM using the same arguments that
asadmin.bat does.

In any event, Byron is the expert on starting GlassFish.

Also, hopefully, you wouldn't need to build GlassFish in order to do
this, but if you did there's build instructions on the wiki.


Byron Nevins added a comment - 29/Jul/11 06:20 PM

I'm not sure what you mean by "the few Java lines needed to issue a GF startup / shutdown / status-query". Perhaps you mean this?

How to start, stop, check status of domain

java -jar "%GF_HOME%\modules\admin-cli.jar" start-domain
java -jar "%GF_HOME%\modules\admin-cli.jar" stop-domain
java -jar "%GF_HOME%\modules\admin-cli.jar" list-domains

How to start, stop, check status of instance

java -jar "%GF_HOME%\modules\admin-cli.jar" start-local-instance instance1
java -jar "%GF_HOME%\modules\admin-cli.jar" stop-local-instance instance1
java -jar "%GF_HOME%\modules\admin-cli.jar" list-instances --long

------------------
If you mean the source lines that run from the above calls – they are definitely not "few". There are thousands of lines required. They are located in admin/launcher, core/kernel, core/bootstrap, cluster/cli, cluster/admin and common/common-util (off the top of my head)


mkarg added a comment - 30/Jul/11 07:08 AM

My idea is basing on in-process because it is the natural way on Windows to implement services, and it simplifies the complexity by far, as there is no more watchdog asadmin needed. GlassFish will just feel and behave as a native service, so no more scripts are involved. Windows admins don't like scripts, as Windows has a completely different architecture compared to UNIX. UNIX does everything in scripts, Windows does virtually nothin in scripts. So the target is, to get rid of scripts.

Thank you Byron for the hint. Actually I looked for the single entry points in Java source code that make the following happen (in pseudo code):

  • GlassFish.start!
  • GlassFish.stop!
  • (opt.) GlassFish.pause!
  • (opt.) GlassFish.resume!
  • GlassFish.state?

If that is not existing, I will inspect what the scripts do and repeat that in pure Java (hence my question about building from scratch; got that meanwhile using svn and mvn BTW). My target is to provide java source that is using / implementing the native Windows API that directory executes this commands in-process.


Tom Mueller added a comment - 17/Oct/12 08:19 PM

Marking the fix version field as "future-release". This is based on an evaluation by John, Michael, and Tom WRT to the PRD for the Java EE 7 RI/SDK. This issues was deemed to not be a P1 for that release. If this is in error or there are other reasons why this RFE should be targeted for the Java EE 7 RI/SDK release, then change the fix version field back to an appropriate build.