Suppose you have a lookup/code database table of +100,000 records:

Database dataset:
1 - Value1 - valid
2 - Value2 - valid
3 - Value3 - valid
...

Every now and then (daily or hourly), a single file comes that contains the full dataset to be used update this table. Any additional rows the database table has that the file dataset does not include should be marked as 'invalid':

New dataset (file):

1 - NewValue1
2 - Value2
4 - Value4

Result:

1 - NewValue1 - valid
2 - Value2 - valid
3 - Value3 - invalid
4 - Value4 - valid

At no point should the database table contain a only a partially updated subset of the data - that is, the full table should be updated as a single unit of work. (Our current logic is: first mark all the data as invalid, and for every insert/update, set the row as valid.)

We wish to use 'batches' such that we can use JDBC's batchUpdate functionality, but we only wish to have one commit to update the entire table at once. At first, we considered setting commit-interval to Integer.MAX_VALUE but we feel that job results should not be dependent on commit-interval. Furthermore, we do not want all the records to be stored in memory at the same time.

So then, we wish to define a JDBC batchUpdate size, but not a commit-size. Is the best solution to code everything in a Tasklet? Or can spring batch's chunking pattern (ItemReader/ItemProcessor/ItemWriter) be used in a special way such that only one commit is performed at the end of the job?

We have considered writing an ItemReader<Iterable<T>> and an ItemWriter<Iterable<T>> such that it is sort of a 'stream' sent between the reader and writer, and the iterable created in the ItemReader gets its data from a delegate ItemReader, and the ItemWriter writes it's data to a delagate ItemWriter. The ItemWriter then could keep count of how much is actually read, something like:

Code:
public class IterableWriter<T> implements ItemWriter<Iterable<T>>, StepExecutionListener {

	private int readCount = 0;
	private int batchSize = 1000;
	private ItemWriter<T> delegate;

	@Override
	public void write(List<? extends Iterable<T>> items) throws Exception {
		// read 'stream' from Iterable and keep track of actual read count
		// send data to delegate everytime the batchSize is reached
	}

	@Override
	public ExitStatus afterStep(StepExecution stepExecution) {
		stepExecution.setReadCount(readCount);
		return null;
	}
}
We have also considered using a helper table (not an actual temporary table, because Oracle empties them upon commit or rollback, and with a connection pool, temporary tables using session retention would be unpredictable). The logic would be to first empty the helper table, populate it in chunks/several commits, then at the end use an execution step to use Oracle's MERGE statement to move the data to the actual table in a single commit. This solution has parallel processing problems such that only a single job can use on the helper table at a time. Also, it seems like a much more complex solution to a simple problem.

The 'lookup' table is given as an example as this is easy to relate to, but we also have several other batch jobs that have the same problem in that many records must be committed in a single transaction, but we still wish to have 'chunking' in relation to JDBC batchUpdate to speed up the import, and the business logic can be easily divided into reader/processor/writer parts.

I guess what I am looking for is an interface that the ItemWriter could implement such that it could control the transaction directly - chunk commits do nothing (or perhaps issue a savepoint), but job commits are the actual commit.

Any suggests?

Thanks!
Paul