Jul 7th, 2008, 01:18 PM
Ignore or Skip Processed Lines Correctly
New to Spring Batch and I need some direction on how to setup a job that will be able to ignore lines in a flat file that have already been processed. Each day I will receive a new file and it will be redelivered multiple times a day and each time it is delivered it will potentially have new data. Also, the data in the file corresponds to that day's transactions. I will need to be able to load (to DB) the file at least three times a day. So the first load will load everything and the second run will only load the new data.
I decide to load the data to a staging table before processing it which will allow me to mark the transaction as processed after is has successfully loaded. The easy part was loading the first run to the database. However, I am not sure about the the second time the job is run. I thought I could just ignore and log the duplicate record DataIntegrityViolationException in the ItemWriter.write() method in a try and catch. This loads the new data but I want to make sure I use the framework correctly so the step can be restarted if needed, is there a better way to handle this situation?
Jul 8th, 2008, 12:18 AM
I think that sounds about right. I think if your input data contains a large fraction of bad data it is better to acknowledge that and deal with it in the business logic (like catch an exception if that's what it takes). A preprocessing step to create a clean input file might also make sense (but obviously would extend the running time of your job).
If the know the data you are processing is always going to be in the same order every time you run the job, I suppose it is also feasible to use the built-in restart capabilities to position the input file reader at the last committed record before starting. To do that you have to make sure that the previous executions of the step fail - e.g. by manipulating the exit code in a step listener. It does mean that post-processing steps have to wait for the restartable step to finish before they can do anything (the step that needs to be restarted could become a bottle neck without some manipulation). We didn't design restart for this use case, so it might not work entirely as expected (I'd be happy to try and fix it if it doesn't).
Jul 8th, 2008, 08:44 AM
I agree with Dave, it sounds like you have a lot of skips, so I would treat records that already exist in the database as no-ops, rather than let the framework skip them. It's a little better in 1.1, since you could prevent rollback on write errors, but it's still probably better to not skip in this scenario.
Jul 10th, 2008, 09:25 PM
Thank you, this is exactly what I was getting hung up on, skip logic or setup batch job to fail until the final run so it can pickup where it left off, etc. I will keep it setup as is and see how it goes.