Bugzilla – Bug 4712
TCK ParallelExecutionTests Issues
Last modified: 2013-09-06 19:19:42 UTC
1. testInvokeJobWithOnePartitionedStep, testStopRunningPartitionedStep, testStopRestartRunningPartitionedStep - How does this test that anything other than a successful run happened (does not validate that any partitioning or parallel execution of any kind actually occured)?
2. testStopRestartRunningPartitionedStep - Doesn't validate that the state of the job is maintained between runs.
3. testInvokeJobSimpleSplit - job_split_batchlet_4steps has an invalid next element in it so the job shouldn't even start. This test does not validate that anything is executed in parallel.
4. testPartitionedCollectorAnalyzerReducerChunkRestartItemCount10 - Does not validate that any parallization occurs. Does not validate what partitions failed vs didnt. Also does not validate that the job failed for the expected reason.
I realize this is not exactly a timely response. Still, since you took the trouble to post I wanted to complete with a review.
There are no spec issues, just arguments that the tests could have been more thorough.
The first point I'll address is the thought we should have tested that parallel execution really occurs (I assume you mean, e.g., look for interleaved timestamps).
Maybe we didn't publicize this enough but we decided to stick to the letter of the spec, which doesn't prevent an implementation from using a single thread serially to execute each of a step's partitions.
(Though as we just discussed yesterday, coincidentally, we do assume you have at least two threads in that the job executes on a different thread than the JobOperator client).
Replying to the rest of your points:
1. Agree these tests are performing very minimal checks. Maybe stop is a bit interesting in that, assuming the impl is partitioning, we're forcing you to confirm that you view the overall job is stopped. But even for this one, and for the others, there's no checking that we are indeed running partitions.
2. Not sure what you meant by "state of the job" not being maintained. I'll make the observation that it would have been much better to test that the persistent data was maintained across the restart. This is non-trivial since each partition is supposed to get its own copy of the persistent user data. Too bad we didn't include that here, as I don't see it in the later tests either.
Another miss: we probably should have forced the issue in confirming that setting the exit status on a partition thread has no bearing on the job's exit status. The JobContext is just a thread local copy. You noticed this behavior in splits and were a bit surprised by it, so it's clearly non-trivial and would have been nice to cover. (I don't see us covering this elsewhere).
3. The invalid @next has been removed
4. I see your point that the test could indeed pass if any two partitions were to resume and then execute a single chunk (really only partitions #1 and #2 should restart but never partition #0).
Still, this let us test more easily and I'd guess most implementations that pass will have indeed implemented this correctly, though it would have been even better to verify as you mentioned.