Page 1 of 2 12 LastLast
Results 1 to 10 of 16

Thread: resume the job after power failure

  1. #1
    Join Date
    Jul 2008
    Posts
    29

    Default resume the job after power failure

    hi, any method available for us to resume the job after power failure.

    Because find out if i manually hang the server and restart again and i try to start the job again, i pass in the same job and paramater. it say the job is running and hit running exception. Anyone have any idea to solve tis kind of problem? thanks

  2. #2

    Default

    In your case the framework didn't get a chance to update the metadata with 'FAILED' status. You can do that manually and job will restart happily, but it's up to you to decide whether data is in consistent state - framework can give no correctness guarantees in case of power failure.

  3. #3
    Join Date
    Jul 2008
    Posts
    29

    Default

    I dont think it is a gud case for us to manually change to failed except we can determine there is any power failure case when we start the jboss server. Because I use the scedular to start the job and alwasy check whether the job is running or not.

  4. #4
    Join Date
    Dec 2006
    Posts
    1,061

    Default

    Unfortunately, it's the best option we have right now for 1.1. It's something we will be addressing in 2.0, however.

  5. #5
    Join Date
    Jul 2008
    Posts
    29

    Default

    ok. hoping 2.0 fix this issue! tys for reply

  6. #6
    Join Date
    Aug 2008
    Posts
    16

    Default The solution for restarting batches after hard stops

    Quote Originally Posted by lucasward View Post
    Unfortunately, it's the best option we have right now for 1.1. It's something we will be addressing in 2.0, however.
    We are planning to use SpringBatch as a complement to our huge amount of COBOL batches. And we are aiming to be able to run Java batches the very same way as we run our COBOL batches. This includes using the same scheduler and the same skilled operators that we have today. There is no chance to have the operators to manually update the repository tables for numbers of batches after the rare case of a hard stop.

    For us, it's a requirement to have this solved by the framework before we can use it in a larger scale. Therefore, we are very interested in getting to know how the solution will look like in 2.0. Is it possible for you to share information with us on this subject at this point-in-time?

    BTW, what is the plan for releasing 2.0?

    Thanks in advance, Len...

  7. #7
    Join Date
    Dec 2008
    Location
    Toulouse, France
    Posts
    22

    Default

    Just a idea !

    Maybe you can verify persisted data at the server startup !
    Any running jobs in database will be changed to failed status at the startup.

  8. #8
    Join Date
    Jun 2005
    Posts
    4,241

    Default

    Quote Originally Posted by lenhen View Post
    Is it possible for you to share information with us on this subject at this point-in-time?
    2.0 has a JobExplorer interface that allows you to pull out the JobExecution and stop it, then save back to database with the JobRepository. JobOperator is also available as a wrapper for those operations using primitives. It should be easy for you to provide your operators with a UI for carrying out that operation.

    BTW, what is the plan for releasing 2.0?
    The schedule is in JIRA: http://jira.springframework.org/browse/BATCH. We don't anticipate any changes right now, but you never can be sure.

  9. #9
    Join Date
    Aug 2008
    Posts
    16

    Default Where to find JobExplorer documented?

    Quote Originally Posted by Dave Syer View Post
    2.0 has a JobExplorer interface that allows you to pull out the JobExecution and stop it, then save back to database with the JobRepository. JobOperator is also available as a wrapper for those operations using primitives. It should be easy for you to provide your operators with a UI for carrying out that operation.
    Thanks for your reply. I downloaded the User's Guide and the only change I could find was that "1.0" was changed to "2.0". And I can't find any information about the JobExplorer interface in the User's Guide. Where can I read about JobExplorer?

    Furthermore, I don't think the solution is to provide a UI where operators should change values in the JobExecution object. In our current batch environment (COBOL) we have a flag in our own repository that has two values. It's either that the job instance has completed ('OK') or it has not completed ('NC'). The 'NC' status is to be interpreted as if the job is still running or that it has been abnormally terminated by some reason, like for example a power outage. The responsibility then lies on the scheduler to determine whether the job should be rescheduled or not.

    In SpringBatch the status can be "running" which must be interpreted as if the job IS running or that it has been abnormally terminated. Someone then has to determine which of the interpretations is correct and if the job has been abnormally terminated, change the status manually and then reschedule the job.

    In a large batch environment (we have thousands of batch jobs) it would not be feasible with all these manual interventions to get the batch jobs running again after for example a power outage.

    So, we hope for a change in SpringBatch to make it optional to be able to restart jobs without the need to clear the status manually.

    /Len...

  10. #10
    Join Date
    Jun 2005
    Posts
    4,241

    Default

    The user Guide is not up to date, but the Javadocs are, and the interfaces are quite self-explanantory for the JobExplorer and JobOperator (I hope).

    I'm not sure how you expect to achieve a recovery from a power failure without changing the status of the existing JobExecutions. Surely you would need to provide your operators with some tools to signal to the Batch system that there had been an abnormal and catastrophic event (either manually or automatically)? The system on its own can't figure out that the existing executions are not still running - someone has to send a signal to something to say that all those RUNNING status values are not actually valid. I don't think the number of jobs is relevant - after a power failure you would either know that all the existing jobs were running or not. Perhaps you could implement a timeout (of your choosing) - if you haven't heard from a job for x hours then consider it dead. The low level APIs to implement those kinds of features are basically in place in 2.0 (suggestions for tweaks welcome).

    If you have some concrete suggestions for improvements, features, or use cases that we could implement time is running out for 2.0, so please tell us what is needed.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •