Apr 16th, 2010, 03:40 AM
Filtering and cleaning out directory - how best to set this up?
We are in the process of setting up a batch job which should do the following:
- for a given directory, scan all the files in it and move all files which are not in a specific format (i.e. EDIFACT) to some "error1" directory
- for every remaining file, read it and basically store the transactions in it in a database
- for every remaining file, based on the result of the previous step move it to some "done" directory, or to "error2" if processing had problems.
Basically we have created readers / writers / tasklets etc to perform each of the individual steps above, and successfully tested them. What somehow eludes us however is how to set it all up in one single 3-step job.
There seem to be lots of options, however which one would be "the" way to do it?
Some of the alternatives I can think of:
- run step 1 as a tasklet, then run step 2 as a partitioned step with a slave for every file. However step 2 can't move off the files after processing - I found you can't do this is a chunked step. So every partition of step 2 then has to save it's status and filename in the job execution context for step 3 (tasklet step) to pick up??
We have all code for this except the storing and retrieving of status info in the context.
- another idea was to have a Partitioner for step 2 which also performs the filtering described for step 1. But then the partitioner would also be tasked with moving non-compliant files to "error1" which somehow feels unnatural to me. Passing info from step 2 to step 3 would be done the same.
We have all code for this, with a custom partitioner which scans the files. The partitioner is not tested yet (wrote this yesterday afternoon).
- yet another idea is to have step 1 be a tasklet as before, and then create a partitioned step where each slave is a FlowStep which contains a conditional execution: first process the file, then choose a move step based on the success or failure of the processing.
On the face of things this seems like the most natural way to do it, OTOH it seems like this might be very complicated configuration wise. As far as code is concerned we have all the necessary parts.
If anybody could offer some advice it would be most helpful, since at the moment we are more or less getting stuck trying to figure out which would be the most solid solution.
The input directory where we read from is in fact the output from an earlier shell script and could basically contain anything. Volumes however are prety low, I would expect no more than ca. 10-15 files per run.
Thanks in advance for any wisdom you might have to offer.