Bug 5370

Summary: Spec is unclear whether JobOperator methods may/must execute synchronously or not (with TCK implications)
Product: jbatch Reporter: ScottKurz
Component: SPECAssignee: ScottKurz
Status: NEW ---    
Severity: normal CC: BrentDouglas, cf126330, issues, mminella, radcortez
Priority: P5    
Version: 1   
Target Milestone: ---   
Hardware: All   
OS: All   
Whiteboard: 1.0_mr_pending

Description ScottKurz 2013-09-09 15:27:09 UTC
Opening this as discussed on our ML with subject "JobOperator synchronous execution and the TCK".

We could decide to leave the spec alone in which case I think we would then only need a change to the TCK.

However, I'd propose another round of discussion and then making the results clear in the spec, along with a TCK fix.
Comment 1 mminella 2013-09-16 17:05:52 UTC
I'd advocate for an implementation specific option for configuring asynchronous JobOperator behavior for start/restart and make the default synchronous.  This allows the simple use case (kick off a job and wait for it to finish) as the default option with each implementation providing an appropriate way to configure asynchronous options (Spring would use a TaskExecutor, other implementations may use a ThreadPoolExecutor/etc).
Comment 2 ScottKurz 2013-10-28 21:53:41 UTC
I propose we deal with this by clarifying that start/restart are asynch from the spec point of view.  

The question of whether an API executes in a blocking/non-blocking manner is a fundamental issue in specifying a portable programming model, especially in an environment like an EE server where threads are always carefully-managed resources.  If the spec needs synch and asynch capabilities we should go ahead and add them to a future release, not leave it up to an implementation-specific mechanism to decide which to use. 

The closest we came to nailing this down was in bug 4110.   Unfortunately, Chris as spec lead only thought that JobOperator#stop() needed clarification explicitly proclaiming this an asynch() method, and did not go ahead and update the start/restart methods accordingly.

All I can really point to in the spec is this in JobOperator#start

* Creates a new job instance and starts the first execution of that
* instance. 

though I think this does suggest asynch over synch.   

The TCK was built on the assumption of asynch execution.    Even in its early forms, the TCK has been built assuming an asynchronous start/restart.  This is why we originally were going to require the implementation provide a callback mechanism.  With Cheng's help, we reworked this through the public ML into the JobExecutionWaiter SPI to allow for polling of job completion.   

It's not that the TCK couldn't be re-engineered to accommodate synchronous start/restart... again, it's that I think the behavior should be the same across implementations.

Perhaps SpringBatch could do something like:  run the TCK in "asynch" mode to achieve compliance, and then factor "synch" mode as an extension?
Comment 3 mminella 2013-10-28 22:28:13 UTC
Scott,

We have a few objections with defaulting to async for the JobOperator's start/restart behavior.  Specifically:

1.  There isn't any use of the full JobExecution being returned by the JobOperator if the execution is async.  It will always be in the "STARTING" state and essentially useless.  If the behavior would have been intended to be async from the start, I would have advocated for just returning just the JobExecution id instead.
2.  When you consider most API calls, async as a default is not a normal use case.  Servlets, EJB calls, JDBC, etc.  All of these are sync by default.  I don't see a precedent within java where this type of behavior should be async by default.  SB has taken the approach that the JobOperator is configured by default to execute the job synchronously (we believe the more intuitive approach from a consumer's perspective) and allow the user to configure it to be async by overriding our TaskExecutor via Spring injection.
3.  You mention "run the TCK in "asynch" mode to achieve compliance, and then factor "synch" mode as an extension?"  This brings up a separate question about the TCK.  How does an implementation specify impl specific options for executing the TCK (beyond tweaking the properties the TCK already uses)?  I have been running under the assumption that we couldn't add anything for the TCK's execution (only defaults would be used).  If this is not the case, it should be documented what is and is not allowed from a customization perspective for an impl to do, yet still be considered passing.
Comment 4 cf126330 2013-10-29 03:17:55 UTC
At the beginning of the spec, it says "Batch processing is typified by bulk-oriented, non-interactive, background execution." By calling it background execution, although not a hard requirement, I guess it sets the direction for the entire spec. I automatically assumed async model, though by hindsight, I can see synchronous execution also has its value.

If we look at the key verbs in batch, "start" and "stop", java developers can easily relate to java thread lifecycle and execution, which implies asynchrony. Due to the fundamental difference than other Java EE applications, I guess async model won't necessarily be a surprise.

I prefer to have async model as the default. In future releases add portable mechanisms for the client to switch to synchronous execution.  For example, add a method:

long start(String jobName, Properties jobParams, boolean synchronousExecution);

Both start and restart method currently do return a job execution id of type long.

As for which modifications are allowed in TCK testing, just to share my experience.  Users should only be allowed to modify jsr352-tck.properties (to change jsr352.impl.classes, various sleep time, and jvm.options). jvm.options can be used to pass in varios options to the target batch runtime.  The installation of an batch implementation typically also include some config files to customize its behavior, for instance, bin/jberet.properties.
Comment 5 ScottKurz 2013-11-06 19:43:42 UTC
(In reply to mminella from comment #3)

Michael, I'm still thinking about this issue, but let me address a couple of your points in the meantime:

> We have a few objections with defaulting to async for the JobOperator's
> start/restart behavior.  Specifically:
> 
> 1.  There isn't any use of the full JobExecution being returned by the
> JobOperator if the execution is async.  It will always be in the "STARTING"
> state and essentially useless.  If the behavior would have been intended to
> be async from the start, I would have advocated for just returning just the
> JobExecution id instead.

The return type from start/restart is indeed 'long'.  It is only the TCK's wrapper which returns a JobExecution.  

The purpose of the wrapper is to reduce boilerplate code waiting for an execution to complete, given that the TCK assumed execution was asynchronous.
A further purpose is to allow an implementor to introduce their own mechanism to signify the job is done executing (i.e. COMPLETED, FAILED, STOPPED) in case a means other than the TCK's built-in polling mechanism is preferred (e.g. a callback).

So, indirectly, I think we're agreeing the use of 'long' return type rather than JobExecution suggests asynch execution.


> 2.  When you consider most API calls, async as a default is not a normal use
> case.  Servlets, EJB calls, JDBC, etc.  All of these are sync by default.  I
> don't see a precedent within java where this type of behavior should be
> async by default.  SB has taken the approach that the JobOperator is
> configured by default to execute the job synchronously (we believe the more
> intuitive approach from a consumer's perspective) and allow the user to
> configure it to be async by overriding our TaskExecutor via Spring injection.

Agreed, synch is the default in Java in general.  The thought that async as default was "obvious" in a batch context has obviously been proved incorrect.

> 3.  You mention "run the TCK in "asynch" mode to achieve compliance, and
> then factor "synch" mode as an extension?"  This brings up a separate
> question about the TCK.  How does an implementation specify impl specific
> options for executing the TCK (beyond tweaking the properties the TCK
> already uses)?  I have been running under the assumption that we couldn't
> add anything for the TCK's execution (only defaults would be used).  If this
> is not the case, it should be documented what is and is not allowed from a
> customization perspective for an impl to do, yet still be considered passing.

That suggestion seems confusing and I think it's incomplete.   

What I was trying to get at here was simply the idea that we all decide to agree asynch is the standard behavior and an implementation will run the TCK against it (with asynch behavior, accordingly).

I didn't mean to get into the details of "modes" of TCK execution for a given implementation.  I was just suggesting:  provide the synch behavior however you want, just as long as we recognize it's non-standard.

However, even better I think would be to look ahead and consider the how we could provide a portable synch behavior, alongside a portable asynch behavior.   

Would you care to share a proposal then, based on your experience?
Comment 6 ScottKurz 2013-11-06 19:47:30 UTC
(In reply to cf126330 from comment #4)
> At the beginning of the spec, it says "Batch processing is typified by
> bulk-oriented, non-interactive, background execution." By calling it
> background execution, although not a hard requirement, I guess it sets the
> direction for the entire spec. I automatically assumed async model, though
> by hindsight, I can see synchronous execution also has its value.

Yes, we assumed the same, and I agree this is the closest the spec comes to stating this explicitly.... though not explicitly enough.

> As for which modifications are allowed in TCK testing, just to share my
> experience.  Users should only be allowed to modify jsr352-tck.properties
> (to change jsr352.impl.classes, various sleep time, and jvm.options).
> jvm.options can be used to pass in varios options to the target batch
> runtime.  The installation of an batch implementation typically also include
> some config files to customize its behavior, for instance,
> bin/jberet.properties.

Yes, I think Cheng's answer is more or less complete:  you're allowed to change the values in 'jsr352-tck.properties' (though I'd like to reserve the right to tweak my answer slightly later in case of an unforseen situation).

But if you have any questions at all about what can and cannot be modified in order to pass the TCK, please ask in a separate email.  We can discuss as well as collect the question on one of the TCK-related Wiki pages.

Thanks,
Comment 7 mminella 2013-11-14 22:02:12 UTC
"Would you care to share a proposal then, based on your experience?"

When I think of APIs that provide both sync/async options or even thread safe vs not thread safe options, I can't think of many scenarios where you just add a parameter and you get one version or another.  There isn't a PreparedStatement#executeUpdate(boolean async) method for example that runs your query asynchronously if you pass in true.  You typically are dealing with a completely different class or group of classes when you move from sync to async.  

The way the JobOperator currently works with Spring Batch is that we use a TaskExecutor to execute the job.  By default, we use a SyncTaskExecutor but allow the user to inject an async implementation if desired (using standard Spring DI).  Obviously this is a very Spring specific example, but it brings me to my point which is that the ability to provide a pluggable way to switch implementations won't be easy and I wouldn't advocate for it as part of the 1.0 spec.  The barriers to this are two fold:

  1. Finding an abstraction that works in all environments for a Executor/WorkManager/etc type of feature.
  2. Determining how to apply that abstraction to the JobOperator in a consistent manor.  The fact that the factory that provides the JobOperator (BatchRuntime) is a class instead of an interface, prevents any implementation specific customization to handle something like choosing which implementation to provide.

I'm not saying that these barriers are insurmountable, but I do think that coming up with a standardized way to handle the switching between the two may be too big of a change for 1.0.

One additional point I'd like to bring up with regards to the default of sync vs async debate is the orchestration of jobs.  While SB and the spec have both (correctly) decided not to address orchestration directly (this provides much more flexibility to the end user and is a good thing IMHO), one of the key aspects of orchestrating jobs is knowing when one ends so the next can begin.  Using an async JobOperator, the only practical ways to identify that a job has completed is via some custom code that polls the JobRepository (which is what the TCK/RI do by default).  I think we'd agree that this isn't a practical solution in a large batch environment.  Sync processing allows the executor to know that a job was complete when the method returns.
Comment 8 ScottKurz 2013-11-19 17:54:52 UTC
Michael,

Thanks for writing up your thoughts.   

For the 1.0 timeframe, though, I still think we need to enforce a hard requirement on an asynch start/restart API.

It's a shame we hadn't realized this was still subject to interpretation earlier.   While within IBM, we had never considered the thought that start/restart would be anything other than asynch, I don't think we neglected to communicate our understanding either.

We had a public ML discussion on the subject of the mechanism used by the TCK to wait for completion (thereby assuming asynch), starting here: 

 https://java.net/projects/jbatch/lists/public/archive/2013-03/message/88

Later in this thread Cheng suggested we switch the TCK to allow for polling on job termination rather than forcing the batch implementation to provide a callback mechanism as we'd been requiring up until that point.

 https://java.net/projects/jbatch/lists/public/archive/2013-03/message/104

We refactored the TCK SPI and added a polling implementation as the built-in default in response.

In this same thread, you even mentioned the SB JobLauncher behavior, and I replied (perhaps not firmly enough) that we were only going to support asynch execution in this message:

 https://java.net/projects/jbatch/lists/public/archive/2013-03/message/107

---

You objected to a de facto polling requirement, but there are ways of avoiding this, e.g. a callback mechanism like the one the RI had been using.

I hadn't wanted to rush into standardizing a synch vs. asynch approach but hadn't wanted to hold us back either. 

I was offering that mainly in case you wanted to be able to present a synch API with the idea that it wasn't yet in the standard but that it was likely to be the future direction in 1.1.

---

Long post and long thread, so just to summarize:  our position is start()/restart() must be asynch, the TCK won't change, and the spec will be clarified via errata in the next maintenance release.
Comment 9 BrentDouglas 2013-11-19 23:15:39 UTC
This sounds settled but I thought I would put my thoughts down anyhow.

I think this issue goes hand in hand with the one regarding JobExecution immutability. It seems to me that the reason the TCK was assuming that the result of `JobOperator#getJobExecution` would reflect the current state of the running execution is because there is currently no other way to determine when the job has completed than polling the resultant value's batch status.

I think you could solve these two issue together by making:

- The returned value of `JobOperator#getJobExecution` should be required to reflect the current state of the JobExecution when called and nothing more. If the returned value of JobExecution#getBatchStatus will change should be implementation dependent and not required by the spec or TCK.

- `JobOperator#start` and `JobOperator#restart` return something more like:

`
public interface JobOperation extends Future<JobExecution> {

    long getJobExecutionId();
}
`

Where the result of `JobOperation#get` should return the state of the execution once the batch runtime has finished processing the job and `JobOperation#cancel` should be analogous as `JobOperator#stop`.

The result of `JobOperation#get` should not be required to change based on any subsequent calls to `JobOperator#restart`, `JobOperator#abandon` or the job repository being modified externally. If you have called `JobOperation#get` and there is a call to JobOperator#stop for that job before the call completes, `JobOperation#get` should throw a cancellation exception in the same way that it would for a call to `JobOperation#cancel`.

This will allow the caller to start the job asynchronously or synchronously as required and has the advantage of being familiar for people used to asynchronous programming.

It retains the ability to see the current state of the job during execution through `JobOperator#getJobExecution`JE so you can still poll on either it's result or `JobOperation#isDone`. Alternatively you can get the result of the job  when it is complete in a blocking manner through `JobOperation#get`.
Comment 10 ScottKurz 2014-01-07 21:53:02 UTC
Changed Javadoc for JobOperator's start and restart methods (I think the delta will be clear enough):

start:

        /**
	 * Creates a new job instance and starts the first execution of that
	 * instance, which executes asynchronously.
         ...

restart:

        /**
	 * Restarts a failed or stopped job instance, 
         * which executes asynchronously.
         ...
Comment 11 ScottKurz 2014-01-09 15:06:11 UTC
Brent,

Sorry, I only just noticed your post as I was making my comment wrapping this up (missed it before).

Thanks for the contribution.

Your proposal seems like a fine starting point where we to try to standardize a blocking execution API.    Though I think a change like that fits more with a possible 1.1 update than treatment as a "spec errata", and the latter is what I'm trying to fit this bug into.

I'd suggest reopening it as a new bug if you're interested.  

I'd like to hear what other feedback we get on this issue going forwards to help us make this decision.