Jul 8th, 2008, 11:26 AM
Inefficient ItemSkipPolicy impl.
Please kindly advice me on the following scenario:
Suppose I have 5 items (1,2,3,4,5) to process in the batch job. I'd like to skip any faulty records and proceed to the next one. I know I can do this using LimitCheckingItemSkipPolicy and set the skipLimit number. To my understanding the logic will go like this:
(Transaction_1) - skipLimit= 10;
1 ..| process OK
2 ..| OK
3 ..| Error!
while error data is encountered, the skip policy will do these:
. Put the error data into a Hashmap
. increase the error data counter (make sure still within the limit range)
. Rollback the transaction.
. Repeat the whole process from the start again, that is start reading from 1,2,...5. and since the '3' has been recorded as a faulty data it will be skipped.
. Commit the transaction if everything is OK.
Now, if the above statement is correct, what happen if let say I have 10K of records and the error is so happen to be in the 9,998th record? Would it hv to start all over again from the first record... isn't it inefficient?
Our traditional way(without springbatch) of doing batch job is to do Commit on every record. Can I achieve the same way using Springbatch? That means, in the event of error it will still proceed to the next record and iterate it till finish. And of course the OK-data will still get committed.
Jul 10th, 2008, 10:12 AM
I'm new at Spring Batch myself, but I think you're talking specifically about the LimitCheckingItemSkipPolicy implementation of the ItemSkipPolicy interface. Sounds like you either want to use the AlwaysSkipItemSkipPolicy or perhaps your own implementation of ItemSkipPolicy instead.
I think what happens in LimitCheckingItemSkipPolicy is it will bomb out with a SkipLimitExceededException that will kill the job so you'd have to start it again, which it does by design in cases where there's a threshold of errors that are acceptable, but beyond which the job should be considered failed.
In your case, it sounds like you don't care how many fail; you just want to keep processing more records as long as there are records to process. I believe that's what the AlwaysSkipItemSkipPolicy is for, but I could be wrong.
Jul 10th, 2008, 01:40 PM
Spring Batch will only keep track of skipped records scoped to the current chunk. In your scenario, that means if your commit interval is 10, any items skipped during those 10 writes will be cached and skipped upon rollback. Once the transaction is committed, they no longer matter, and it doesn't hold on to them.
I think you're right in asserting that lots of write failures (which would be the only type of failure that causes a rollback, read failures do not) would mean lots of rollbacks, and that you would likely benefit from a lower commit interval than if the data were cleaner (or had more read errors) However, lower commit intervals are much less efficient than higher commit intervals, since starting and committing transactions is expensive.