Bug 4702 - TCK ChunkTests issues
TCK ChunkTests issues
Status: RESOLVED FIXED
Product: jbatch
Classification: Unclassified
Component: source
1
All All
: P5 normal
: ---
Assigned To: ScottKurz
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-02-20 21:02 UTC by mminella
Modified: 2013-08-29 14:46 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description mminella 2013-02-20 21:02:59 UTC
Reviewing the test cases in the ChunkTests test within the TCK raises the following issues:

1.  testChunkRestartItemCount7, testChunkRestartItemCount10, testChunkRestartChunk5 - How do these methods validate that the jobs failed at the correct point, restarted at the correct point and processed the correct number of records in each attempt?  The only assertions that are made are the status of the job and if it was restarted (via an instance id check).  Also, besides the number of records to be read in/processed, all three of these methods seem to be the same test (redundant).


2.  testChunkRestartCustomChunkDefaults - This test doesn't actually verify the chunk size directly.  Since time can have an impact on chunk size, while this test will probably work, it's not the most direct way of verifying chunk size.  


3.  testChunkRestartCustomCheckpoint - No validation that the number of items processed is correct, no validation that it was a true restart (same issue as all of the tests above), no validation that checkpointing itself was used to maintain the required state (it just requires that the state be maintained between executions which could probably be accomplished by reusing the same instances of the batch artifacts).

4.  testChunkTimeBasedDefaultCheckpoint - Error message in MyTimeCheckpointListener#beforeChunk is incorrect.  It implies that the chunk committed sooner when it could have also committed later.  Also, I didn't think that there was a default timeout for chunk processing...

5.  testChunkRestartTimeBasedCheckpoint - Does not validate that checkpointing is actually being used for this.  Also does not validate the number of items processed each run or total number of items processed.

6.  testChunkRestartTimeBasedDefaultCheckpoint - Does not validate that checkpointing is being used for this.  Does not validate the number of items processed per run or total.

7.  testChunkSkipRead, testChunkSkipProcess - All of the statuses set by the listeners are the same so if any of the listeners are triggered, the test would pass.  Statuses should be more specific.

8.  testChunkSkipWrite - Tests against the default exit status so it would pass if no items were skipped.

9.  testChunkSkipReadExceedSkip, testChunkSkipProcessExceedSkip, testChunkSkipWriteExceedSkip - All of the statuses set by the listeners are the same so if any of the listeners are triggered, the test passes.  No validation that the job failed due to the skip limit being hit (just that a skip occurred and that the job failed).

10.  testChunkSkipNoReadSkipChildEx - All of the statuses set by the listeners are the same so if any of them are triggered via a skip, the test passes.  No validation that the failure of the job was due to the skip limit being hit.  The only validation that occurs is that an item was skipped and that the job failed.

11.  testChunkRetryRead - Does not use/validate retryable-exceptions.  Does not validate that any retry occurred, just that the job was completed successfully.

12.  testChunkItemListeners - All three runs validate against the read listener's constant (instead of the appropriate listener's constant per each run).  All of the constants also are set to the same value so the validation would be inconclusive.  Also the error methods on the listeners are not tested.
Comment 1 ScottKurz 2013-02-20 23:15:58 UTC
(In reply to comment #0)

> 1.  testChunkRestartItemCount7, testChunkRestartItemCount10,
> testChunkRestartChunk5 - How do these methods validate that the jobs failed at
> the correct point, restarted at the correct point and processed the correct
> number of records in each attempt?  The only assertions that are made are the
> status of the job and if it was restarted (via an instance id check).  Also,
> besides the number of records to be read in/processed, all three of these
> methods seem to be the same test (redundant).

Michael, thanks for the review.

We're still going through your list, but I wanted to start by giving an answer to the first since it will show a pattern we use often in these TCK tests.

You're not seeing more assertions (than the checking of batch status) in the testng/junit test method, since the validation is really happening in the batch artifact.   

The batch artifact, in this case the writer (DoSomethingArrayItemWriterImpl), contains logic to determine if the chunk writes are occurring at the correct point upon restart.  It can distinguish between the original run vs. the restart case by the value of the checkpoint data passed into the open() method.

If the checkpoint data is not in sync with the known write point, the writer artifact throws an Exception which fails the 
restart job and puts the BatchStatus into FAILED state. 

The artifact is written to be reused across more than one test by injecting the list of expected values via JSL as BatchProperty(s). 

The three methods are using different job xmls in which the item-count values are set to different values. 
These values are less than the default value (10), greater than the default value and exactly equal to the default value, to 
ensure these values are being honored over the default value.

Having said all that I don't see a further need to "validate that the jobs failed at the correct point" as you mentioned.

------------------------------------------------------------------------------

Though we can appreciate the value of keeping all the assertions in one place (e.g. the JUnit/TestNG test method), that seemed to result in unnecessary passing back and forth data when simply "blowing up" would do the trick.  

Will review the others tomorrow...
Comment 2 mminella 2013-02-21 14:45:12 UTC
Scott,

I have to disagree that the validation you are proposing goes far enough.  From what you are saying, the assumption is being made that the state is being reset correctly on the restart in order to validate that the job is picking up where it left off.  However, that cannot be a valid assumption.  The test should verify that the state was re-established correctly, and therefore is picking up at the correct spot.  For example, if my runtime does not pass in an Externalizable on restart, since you do not pass in a record fail parameter on the restart, the ItemReader assumes that it is a new run and therefore resets the index and would rerun the job to completion and the test would therefore pass.  Is my understanding incorrect?
Comment 3 ScottKurz 2013-02-21 21:25:56 UTC
(In reply to comment #2)
> Scott,
> I have to disagree that the validation you are proposing goes far enough.  From
> what you are saying, the assumption is being made that the state is being reset
> correctly on the restart in order to validate that the job is picking up where
> it left off.  However, that cannot be a valid assumption.  The test should
> verify that the state was re-established correctly, and therefore is picking up
> at the correct spot.  For example, if my runtime does not pass in an
> Externalizable on restart, since you do not pass in a record fail parameter on
> the restart, the ItemReader assumes that it is a new run and therefore resets
> the index and would rerun the job to completion and the test would therefore
> pass.  Is my understanding incorrect?

Michael, 

It doesn't sound like you have an issue with the general idea of validation done by the artifact but you're saying we have a gap in this test (or tests) where we would incorrectly pass a wrong impl.

I think you make a good point.

As it turns out the test would indeed fail (as it should), but maybe in too roundabout a way?  The writer (DoSomethingSimpleArrayWriter, not DoSomethingArrayItemWriterImpl as I'd said earlier), uses a fixed int[30] array to write to.   If only the reader failed to checkpoint and started from the beginning... and the writer steadily advanced its index... it would overflow by the time the original execution + restart were finished.   (If the writer failed to checkpoint either alone or together with the reader we have a more direct failure).

I think we could close this gap by more directly checking the checkpoint value passed into the reader open.   

On a side note, I'll mention that we're using the persistent step context at the moment to provide the variable for the artifact(s) to distinguish if this is the restart or the original execution.   For some of these tests, we were using restart override parameters at one point to make this same distinction.  Restart parms were removed from the spec, then re-added.. so you might seem some leftover artifacts of this churn.

Another note, which gets to your point 2.).... these tests were written before the spec moved to allow item or time-based checkpointing, whichever came first... so they made some assumptions which might need some rethinking.   

I'm not sure how much we've thought about this.. so owe you an answer.   Also in looking at this I think we might have an RI bug in how we calculate time elapsed... which could additionally confuse the execution of these tests.
Comment 4 ScottKurz 2013-02-22 23:21:08 UTC
Finally getting around to responding in detail (though there's still a place or two I need to look at more but I didn't want to end the week without responding).

Thanks again for your comments.

(In reply to comment #0)
> Reviewing the test cases in the ChunkTests test within the TCK raises the
> following issues:
> 1.  testChunkRestartItemCount7, testChunkRestartItemCount10,
> testChunkRestartChunk5 - How do these methods validate that the jobs failed at
> the correct point, restarted at the correct point and processed the correct
> number of records in each attempt?  The only assertions that are made are the
> status of the job and if it was restarted (via an instance id check).  Also,
> besides the number of records to be read in/processed, all three of these
> methods seem to be the same test (redundant).

Building on my earlier comments...let me put this into my own words and see if you agree. 

"We are not directly checking that the reader/writer open() methods are passed 
the current checkpoint values on original execution and restart.   
We are relying too much on downstream behavior in the writer."

Now, I'd say the logic in the writer is a bit complicated in that it relies on an interplay between persistent user data, @BatchProperty(s), and a couple array indices... but I believe we are effectively testing that the correct number of records are being written in each chunk, in each execution.   

> 2.  testChunkRestartCustomChunkDefaults - This test doesn't actually verify the
> chunk size directly.  Since time can have an impact on chunk size, while this
> test will probably work, it's not the most direct way of verifying chunk size.  

This doesn't seem to be doing anything important...will look into it and possibly remove.

> 3.  testChunkRestartCustomCheckpoint - No validation that the number of items
> processed is correct, no validation that it was a true restart (same issue as
> all of the tests above), no validation that checkpointing itself was used to
> maintain the required state (it just requires that the state be maintained
> between executions which could probably be accomplished by reusing the same
> instances of the batch artifacts).

As in 1), I think the number of items processed is effectively tested, as is
the writer checkpoint (I mentioned this in my comment #3).   As in 1), the
reader checkpointing could be tested more directly.   

Regarding your point: "no validation that it was a true restart".  Not sure
I'm understanding.. have I already addressed that or are you suggesting something further?   I could imagine some additional assertions but I'm sure they'd add too much value so I'd like to hear your thoughts.

> 4.  testChunkTimeBasedDefaultCheckpoint - Error message in
> MyTimeCheckpointListener#beforeChunk is incorrect.  It implies that the chunk
> committed sooner when it could have also committed later.  Also, I didn't think
> that there was a default timeout for chunk processing...

Thanks for catching that... our code is assuming a default of 10 seconds. The default should now be '0', i.e. don't checkpoint due to time.  Also the error message should be improved.

> 5.  testChunkRestartTimeBasedCheckpoint - Does not validate that checkpointing
> is actually being used for this.  Also does not validate the number of items
> processed each run or total number of items processed.
> 6.  testChunkRestartTimeBasedDefaultCheckpoint - Does not validate that
> checkpointing is being used for this.  Does not validate the number of items
> processed per run or total.

I'm assuming the earlier comments already do (or do not) address the points here
as well.  Correct?

> 7.  testChunkSkipRead, testChunkSkipProcess - All of the statuses set by the
> listeners are the same so if any of the listeners are triggered, the test would
> pass.  Statuses should be more specific.
> 8.  testChunkSkipWrite - Tests against the default exit status so it would pass
> if no items were skipped.

Agreed.  We'll make that change (to a reader/processor/writer-specific exit status).

> 9.  testChunkSkipReadExceedSkip, testChunkSkipProcessExceedSkip,
> testChunkSkipWriteExceedSkip - All of the statuses set by the listeners are the
> same so if any of the listeners are triggered, the test passes.  No validation
> that the job failed due to the skip limit being hit (just that a skip occurred
> and that the job failed).

OK, we can make a reader/processor/writer-specific exit status.. but I'm not seeing an obvious way to test "the job failed due to the skip limit being hit".   Is that really worth going out of our way for given that normally the job will complete succesfully, in contrast?

> 10.  testChunkSkipNoReadSkipChildEx - All of the statuses set by the listeners
> are the same so if any of them are triggered via a skip, the test passes.  No
> validation that the failure of the job was due to the skip limit being hit. 
> The only validation that occurs is that an item was skipped and that the job
> failed.

We will confirm that the one skip read listener is called exactly twice, with 
the included exception each time (not any other exception).

(Also looks like the @test_Strategy comment got botched).

> 11.  testChunkRetryRead - Does not use/validate retryable-exceptions.  Does not
> validate that any retry occurred, just that the job was completed successfully.

Let me look at this one some more.

> 12.  testChunkItemListeners - All three runs validate against the read
> listener's constant (instead of the appropriate listener's constant per each
> run).  All of the constants also are set to the same value so the validation
> would be inconclusive.  Also the error methods on the listeners are not tested.

Ok, we'll separate out the exit status so each of the three runs is reader/processor/writer
specific.

Also.. you're right, the error methods are not tested.  Will add a test there.
Comment 5 mminella 2013-02-23 02:19:09 UTC
Scott,

#1  I think I may see where I'm going wrong on my understanding of this, but it leaves, IMHO a more serious spec related issue.  It seems to me that these tests are assuming on restart that the parameters provided in the last execution will be re-applied in the new execution (What leads me to that is that there is no null check on of the readrecordfailNumberString in the DoSomethingArrayItemReaderImpl prior to the Integer.parseInt).  That can't work.  It prevents you from removing a previously passed in value.  I'll take a look to see if an issue is already opened on that and if not, open one.

#3  What I mean by "no validation that it was a true restart" is that there is no check that the restart provided a new JobExecution.  All it checks is that the JobExecution provided has the same JobInstance reference (which it would if we restarted the old JobExecution instead of creating a new one like the spec requires).

#9  I do feel that it is important to validate that the skip limit was the reason for the job failure.  When I think of all of the ways people could code a runtime like this up, we need to be sure we are being as explicit as possible in validating the executed functionality.
Comment 6 ScottKurz 2013-02-23 06:02:43 UTC
(In reply to comment #5)
> Scott,
> #1  ...
Michael
...I'm going to read through that carefully a bit later

> #3  What I mean by "no validation that it was a true restart" is that there is
> no check that the restart provided a new JobExecution.  All it checks is that
> the JobExecution provided has the same JobInstance reference (which it would if
> we restarted the old JobExecution instead of creating a new one like the spec
> requires).

I see... well, we hadn't thought this was that interesting to test.  Would you agree that if we added this assertion to a few of these chunk restart tests that we wouldn't need it on every test in the whole TCK that performs a restart?   That doesn't seem like that tricky a thing to get right (restart API = new object instance).

> #9  I do feel that it is important to validate that the skip limit was the
> reason for the job failure.  When I think of all of the ways people could code
> a runtime like this up, we need to be sure we are being as explicit as possible
> in validating the executed functionality.

So how would you propose validating this was the reason for failure?
Comment 7 ScottKurz 2013-02-23 15:37:32 UTC
I'm going to add one more shortcoming I found

13. Spec says "The item-count attribute is ignored for "custom" checkpoint policy."    TCK should verify this... (I just tried this and the RI has a bug in this regard... it wrongly honors item-count).
Comment 8 ScottKurz 2013-02-23 16:22:14 UTC
(In reply to comment #5)
> Scott,
> #1  I think I may see where I'm going wrong on my understanding of this, but it
> leaves, IMHO a more serious spec related issue.  It seems to me that these
> tests are assuming on restart that the parameters provided in the last
> execution will be re-applied in the new execution (What leads me to that is
> that there is no null check on of the readrecordfailNumberString in the
> DoSomethingArrayItemReaderImpl prior to the Integer.parseInt).  That can't
> work.  It prevents you from removing a previously passed in value.  I'll take a
> look to see if an issue is already opened on that and if not, open one.

Michael, thanks for noticing this.  Here at one time the RI/TCK + spec were in sync but have diverged.

The RI/TCK is indeed assuming that on restart you simply pass the "override" parameters, which are added/merged to the properties submitted on the original start.

The spec does not currently say that this is the correct behavior... so currently the RI/TCK is "wrong".

I opened new Bug 4715 to track this question of how to treat this in the spec.
Comment 9 mminella 2013-02-25 18:30:34 UTC
With regards to both #3 and #9, if it's interesting enough to be in the spec, it needs to be tested.  If we have a behavior that we cannot validate via a test, that tells me that the user will be in the same trouble when trying to debug that behavior.

#3.  That being said, as long as the TCK validates the behavior thoroughly, I'm ok with it.  

#9.  Within SB, we store the reason for a failure in the job repository so validation would be easier there.  For the TCK, what is the way the batch runtime is expected to communicate why it failed?  While failing a job is important, so is telling the user why.  This sounds like a miss in the spec...
Comment 10 cvignola 2013-08-29 14:46:55 UTC
Looks to have been sufficiently addressed, and at least it seems clear there haven't been any spec issues raised (but not addressed).