Bugzilla – Bug 4171
ChunkContext is missing
Last modified: 2013-02-05 20:09:24 UTC
There are places where a ChunkContext would be useful. Spring Batch does have this context.
Why is step context insufficient?
chunk context was not added
After reviewing the current version of the spec, I'd like to reopen this issue. A ChunkContext would be very valuable in the context of a partitioned step. I would expect since there is a single step involved, that there would be one StepContext. This potentially can lead to issues of partitions stepping on each other during it's use. Introducing a ChunkContext would make sense in at least this scenario.
In 220.127.116.11 it does say for each of JobContext/StepContext that each thread gets its own copy of the context...e.g.:
"For a partitioned step, there is one StepContext for the parent step/thread; there is a distinct StepContext for each sub-thread."
Does that address your concern?
And to add to Scott's comment. Each partition has access to a cloned copy of the job and step context so there are no synchronization concerns. The partitions are free to write into their copy of the context, but those changes are not coalesced in any form into the original contexts. The developer can use the PartitionCollector to pass data from each partition to the PartitionAnalyzer, which in turn can coalesce store that data in the original contexts. The collector runs on a partitions thread; the analyzer runs on the step main thread.
To Comment#4: This might satisfy the concern but I'm realizing I didn't quite understand the relationship of stepExecutionContexts to partitioning. For example, in SB you can see that we do create separate step execution contexts for each partitioned step (see http://static.springsource.org/spring-batch/reference/html-single/index.html#partitioning). I'm realizing that this probably corresponds to a partition plan + the cloned step context but how we inject the key and name into the context in translating this to jsr352 isn't quite clear yet.
Chris's clarification of Scott's comment is actually my concern. If I am processing within a chunk and I make a change to the StepContext, I would expect that to be available to everyone within the step. Yet according to Chris's clarification, it would not be because I'd only be updating a clone. In SB, having a separate ChunkContext allows us to not run into this confusing scenario. Taking the cloned context approach as Chris proposes, this allows for no shared state across partitions.
(In reply to comment #6)
To what key and name do you refer?
> To Comment#4: This might satisfy the concern but I'm realizing I didn't quite
> understand the relationship of stepExecutionContexts to partitioning. For
> example, in SB you can see that we do create separate step execution contexts
> for each partitioned step (see
> I'm realizing that this probably corresponds to a partition plan + the cloned
> step context but how we inject the key and name into the context in translating
> this to jsr352 isn't quite clear yet.
(In reply to comment #7)
Questions of my own:
1) what is in a chunk context?
2) what is the advertised purpose of the chunk context?
3) what is the cardinal relationship of a chunk context with a partition?
4) what is the sharing scope of a chunk context?
5) are chunk contexts available to non partitioned steps?
The following must be said about the partition model currently in the spec to continue this discussion:
1) It is designed so the programming model at the reader/writer/processor level is no different for partitioned vs non-partitioned execution. So StepContext is available whether partitioned or not.
Note not all chunk type steps write to their step context - remember, the checkpointing model specified by the JSR persists checkpoint data via an explicit contract between runtime and reader/writer, not by persisting a context.
Now that is not to say the step context does not have a persistent field. It does, which batch artifacts may optionally use. It is stored at each checkpoint or end of step, whichever comes first. However, this persistent field is not part of the checkpoint/restart contract of the reader/writer. It is for other data the step implementation may which to persist and get back upon restart - e.g. user-defined metrics, interrim calculations, other.
2) Partitions can contribute information to the overall step outcome.
Partitioned steps may push data from each partition to a step-level coalescing point using PartitionCollector/PartitionAnalyzer artifacts. This makes it possible to take data from each partition's step context copy and send it to a consistent point for merger into the master copy step context.
This allows for sharing in a thread safe manner without the user writing thread safe code with synchronizations. All of which would violate principle #1 above.
Moreover, the PartitionCollector/PartitionAnalyzer artifacts can be added when and if needed without disturbing the core reader/processor/writer logic. You would need them if the step artifacts were writing to the step context that needed to be collected together for the entire step.
> Chris's clarification of Scott's comment is actually my concern. If I am
> processing within a chunk and I make a change to the StepContext, I would
> expect that to be available to everyone within the step. Yet according to
> Chris's clarification, it would not be because I'd only be updating a clone.
> In SB, having a separate ChunkContext allows us to not run into this confusing
> scenario. Taking the cloned context approach as Chris proposes, this allows
> for no shared state across partitions.
Let me summarize my position on this subject:
Chunk context will not be added.
Each partition has a step context clone. The scope of this clone is partition-local. The reason is so partition vs non-partition programming model is the same.
A step that needs to coalesce partition-local values to the master step context may do so by using partition collector/analyzer pair.
Didn't mean to reopen this one.