Aug 27th, 2008, 01:29 PM
I need some guidance on how I could use Spring Batch to solve a problem we are having.
Here's the scenario:
A user uploads a large data file from a webpage.
This data file is then parsed and for each record, I insert a record in the database and then send a jms message to a queue (where some further processing takes place). This all happens within the context of a jms receive event (the original document is parsed and serialized into xml and sent to a queue for processing). I'm hitting the limit on the file size I can process because of the time it takes for the jms send of each record and the db update is causing my transaction to timeout.
My fuzziness is around how to dynamically create a job that can survive container restarts. Almost all of the examples show them reading a static file. I need to read dynamic file content and I need to be able resume the job execution if the container is shutdown and restarted.
I'm already using clustered quartz for restartable job execution in other areas. I'm just not sure how I'd implement batch job restarts.
Aug 28th, 2008, 12:53 AM
File Size Limit
I have encountered such an issue before where the file needs a long processing time. What I did, is to store the file in a temporary storage and to tell the user that the job has been queued.
In this case, the file upload transaction will not time out. An email or similar notification is sent to the end user at the end of the process.
Thus, there is a higher level job queue on uploaded file level then your application can process the records one by one.
Aug 28th, 2008, 01:46 AM
If you process a large file in a single transaction I don't see how it helps the timeout to store the file in the filesystem. If it's too large, it's still too large.
But you do need to get that file off the messaging middleware so you can deal with it at leisure. I see two options: either use the filessytem or a relational database staging table (big BLOB/CLOB). Either would work and be recoverable across restarts in most failure scenarios. The database is more robust only if you can use XA/JTA, since then there is no way for the file to save and the message to rollback and be re-presented. The file will be easier to use out of the box with Spring Batch because Spring has a Resource implementation out of the box for files (you would have to supply your own for the database object).
The trick you need for Spring Batch is just to launch your job with a JobParameter that locates the file dynamically (which is different than the samples for obvious reasons - they have to be repeatable). The JobParamater value would be either a file URL or a database primary key, depending on where you put the data.
Aug 28th, 2008, 08:58 AM
Ok, so here's my intended approach:
1) Receive the document via jms and reserialize to disk
2) Kick off a Spring Batch Job pointed at the dynamic file name
3) Kick off a Quartz job with the filename as a parameter in the jobdatamap
4) The quartz job will periodically try to recreate the Spring Batch Job. If attempting to rerun the job throws a JobExecutionAlreadyRunningException then go back to sleep. If it throws a JobInstanceAlreadyCompleteException, unschedule the quartz job--we are done. Otherwise, execute the job (because it would appear it isn't running).
The only thing I'm really fuzzy on is (4). I'm not sure what happens in Spring Batch if the container is shutdown in the middle of a batch. When it comes back up and I try to rerun the job, will it be in the correct state so my algorithm will work or will the state saved in the database be in an inconsistent state (e.g. the job or a step is not complete but appears to be running)? This is especially important since my application is running across a cluster of a half a dozen machines. Any one of them can receive the original jms message and any other can process the quartz job (since quartz is clustered). I'm using the db backed spring job registry. We have a shared filesystem (SAN) mounted on each of the servers so they will all be able to see a file that any of them writes out.
In my case, the processing of this file needs to be completely automated and needs to be failsafe (i.e. needs to restart itself in case of failure). Restarting failed jobs manually is not an option.
For (2), I'm simply using a Composite Writer to write to the db and to the jms queue in the same step (although in my case, one of the writers updates the object being written so my composite writer uses ordered writing so this is possible).
Last edited by chudak; Aug 28th, 2008 at 09:01 AM.
Aug 28th, 2008, 09:14 AM
4) sounds about right. Your concerns are valid, but it should be fine for any normal failure scenario. The only way it can break is a true lights out or "kill -9" of the database or the job lancher, because in this case the semantics for Spring Batch 1.x are not well defined. The intermediate state should be OK up to the last commit point, so manually changing the StepExecution in the database to status=FAILED and restarting produces a happy outcome. Spring Batch 2.0 has some features to deal with this in the API.
Aug 28th, 2008, 09:32 AM
Thanks Dave. BTW, when is 2.0 scheduled for release? End of October? Or are you gonna do another milestone?
Originally Posted by Dave Syer