Bugzilla – Bug 4269
buffer-size should be a boolean
Last modified: 2013-01-16 15:19:52 UTC
In section 5.2.1 of the spec, buffer-size is an integer used to indicate how many items to buffer prior to writing. However, this buffer size should be a function of the chunk size and not independent of that. Otherwise, you can end up with a scenario where the chunk size is 10 but a buffer size of 15. How is that handled?
This attribute should be a boolean indicating if the input should be buffered at all or not and let the implementation handle how large the buffer is.
I do not agree. The buffer-size is an integer on purpose so you can buffer and commit with different criteria - especially for the following scenario:
That means commit every 30 seconds, buffer 100 items before writing.
The case you are concerned with is
That means commit every 10 items, buffer 15 items before writing. The runtime will invoke itemWriter when the commit-interval is reached, regardless of the buffer-size specification.
I updated the documentation of buffer-size to include this statement:
When the checkpoint policy commit-interval is reached, the buffer is emptied to the ItemWriter immediately before the checkpoint is committed whether or not the buffer is full.
I'm a bit confused. Are checkpoints not tied to commits? The configuration you propose here:
Are you saying that you could have multiple buffer flushes per commit (something that is not allowed in Spring Batch) if you process, say, 200 items in that 30 second interval? This adds a level of complexity that I'm not seeing the benefit of. What is the use case where this is needed?
Yes, checkpoints are tied to commits insofar as when a checkpoint is taken, a commit occurs.
The reason for setting a buffer-size is so time-based checkpoint intervals don't get OOM. Time-based checkpoints are helpful if you're trying to avoid transaction timeouts.
BTW, default buffer-size is commit-interval when policy is items, which is also default. So if you're doing Spring style checkpointing, all you have to specify is commit-interval.
Multiple flushes is not a problem for transactional readers. But it is a problem for non-transactional readers.
Maybe a better approach is to control the commit interval with an item count and optional time limit. Buffering would then be a boolean choice. The count would cause a commit every N (N>0) items. The time limit would cause a commit every S (S>0) seconds. If buffering, the buffer would never be larger than N. The result would be commit every N items or S seconds, which ever comes first.
I don't see why OOM issues are tied only to time based checkpoints. They can happen in either case. If you are concerned with time based ones...either switch to item based ones or reduce your time interval.
I think this complexity is making this feature less useful than it could be. In Spring Batch, the framework supports two ways of identifying commit intervals: number of items and the result of a call to the CompletionPolicy.isComplete. This is setup this way because item counting is a fundimental piece of information when processing an item based step (so we include that as a commit interval metrics) and provide a point of extension for the developer to provide any other method they wish. This eliminates the confusing scenarios we are getting into (where timeouts and item counts work in special ways, etc). With it implemented this way, along with tying checkpoints to commit interval, the commit interval becomes a boolean and simplifies the entire thing.
My thoughts ...
1st. Well, sure OOM is not exclusive to time-based checkpoints. I will suggest, however, time-based is more vulnerable. I will also suggest item-based is more vulnerable to transaction timeouts than is time-based.
2nd. I agree the currently specified chunk externals for checkpoint configuration are confusing and need to be changed.
3rd. I agree that buffer-size should be a boolean (e.g. buffer-items=true|false)
4th. I do not agree to have an item-based checkpoint criterion only.
5th. I offer an analysis and proposal below that I invite you to consider.
It would be useful to take a step back and consider what requirements and issues I think are driving the spec's approach toward specifying a step's checkpoint policy:
1. Requirement: batch developers must have a simple declarative way to specify checkpoint policy.
2. Requirement: declarative checkpoint policy must enable reliable, predictable checkpoint behavior.
3. Requirement: batch developers should have a way to provide custom checkpoint criteria.
1. Issue: A strictly item based checkpoint policy is vulnerable to transaction timeouts.
Why? Because you can't reliably predict how long it takes to process N items and you are not assured of specifying a transaction timeout in your job that will guarantee success in your production environment.
Observation: specifying time as a checkpoint criterion facilitates avoiding transaction timeouts by making the duration of the checkpoint interval more predictable.
2. Issue: A stricly time based checkpoint policy (with buffering enabled) is vulnerable to OOM.
Why? Because you can't reliably predict how many items the system can buffer in T amount of time and you are not assured your job has access to sufficient memory to guarantee success in your production environment.
Observation: specifying a buffer limit (i.e. item-count) as a checkpoint criterion facilitates avoiding OOMs by making the checkpoint working set more predictable.
3. Issue: it is insufficient to overcome issues 1 & 2 above in a production environment by instructing operators (or developers) to simply modify their checkpoint criteria.
Why? Because post-production tuning is disruptive to production itself and needs to be avoided. The need for post-production tuning can be reduced by tuning in a pre-production environment. However, accurate tuning depends on homologous environments, which itself cannot be assured.
I submit the following proposal to address the aforementioned requirements and address the stated issues in a straight-forward way.
1. Make item based checkpoint criteria default and simple.
2. Allow for both count and time as checkpoint criteria.
3. Allow for for custom checkpoint criteria.
4. Accomodate item buffering as an orthogonal capability.
Externals (chunk attributes)
The default is item. Specifying custom means use a checkpoint-algorithm (already in spec), which is specified separately in the job. Reminder: a checkpoint-algorithm is an application-provided boolean-based commit interval, allowing for user-defined checkpoint criteria.
2. item-count=<number of items before checkpoint>
Default=10. After the specified number of items is processed, a checkpoint is taken.
3. time-limit=<time before checkpoint>
Default=0, which means no time limit. After the specified amount of time transpires, a checkpoint is taken.
The default is true. Specifies whether items are buffered until checkpoint interval reached. When checkpoint interval reached, itemWriter is called once with a list of the buffered items.
Note: if you combine item-count and time-limit, a checkpoint occurs after item-count items or after time-limit time, whichever happens first.
I am going to put the proposal described in comment #5 into the spec and close this bug.