Apr 7th, 2010, 10:22 AM
Partitioned Execution Restart Requirements
I am attempting to reason through how job restarting might work in the case of partitioned batch execution. It seems to me that an implicit requirement of the restart logic is that any partitioned step executions from a previous (failed) run need to be stopped before a partitioned job can be restarted.
I am particularly interested in the case where partitioned execution is done over a processing grid that is a cluster. There are two distinct cases:
In other words, in order for the restart logic in Spring Batch to work, any processing from previous job executions needs to be terminated (by the PartitionHandler) before processing can be restarted. Does my reasoning make sense?
- A node in the cluster fails where a partitioned step execution has been dispatched to. In this case, I expect the failover logic for the cluster to reassign that work to another node in the cluster. Restarting the step execution makes sense, and the step execution picks up from where it left off where the last chunk was committed.
- The node where the partitioned step execution was initiated fails (i.e., where the PartitionHandler executed). This node is typically going to be in a state where it is polling synchronously waiting for the partitioned step executions to complete. If the job is restarted (either explicitly, or via failover), it can't simply restart each of the partitions since some of them could either be executing or queued for execution on other nodes in the grid. The restart support in Spring Batch doesn't not handle this automatically, nor could it possibly be expected to since it doesn't have knowledge of the state of those step executions on the grid.
Apr 8th, 2010, 09:21 AM
For #1, the retry logic is supposed to be provided by the grid system you are using.
For #2, Spring Batch knows the state of your step since the status of the related step execution is updated in the database. Besides, you can check that it is processing by checking if the read/process/writer counters are updated (we used that namely to implement a progress monitoring service). Now, I don't know what happens if you try to restart a job (e.g. if Spring Batch will check the statuses of the underlying step executions and prevent a restart).
Apr 8th, 2010, 11:19 AM
I am thinking specifically of what happens SimpleStepExecutionSplitter.shouldStart which does check whether the status of the lastStepExecution is UNKNOWN or COMPLETED, but it doesn't check for example that the lastStepExecution might be STARTED - and if a step execution from some previous execution is executing, it isn't part of this partitioned step execution.
It would be nice to be able to adopt the currently running step execution into the current step execution, but I don't see how that is possible. The best I can see doing within the current framework is to stop any previous step executions and then restart them within the context of the current step execution.
Apr 9th, 2010, 06:46 AM
I agree, this is interesting. The way that #1 plays out depends very much on the grid fabric (but I agree with Stephane that it is the fabric's responsibility). As far as #2 goes, the splitter could certainly be smarter than it is about existing step executions that are neither UNKNOWN nor COMPLETED (please raise a JIRA if you are interested in seeing changes). The easiest thing to do and the least disruptive for 2.1.x would be simply to fail the current partition step execution with a message that says there are active executions in the grid from a previous JobExecution. The user can then decide whether to wait for them to finish or try to stop them. I think it is a business decision whether to wait or attempt to stop the steps executions that are in flight, but maybe it can be automated with some strategy in a splitter, so I would be keen to hear from anyone who tried it.
Tags for this Thread