Jan 2nd, 2008, 06:52 PM
large transaction handling strategies?
One of the common problems I always run into when running really large batch load processes (millions of update) is how to handle large transaction sizes. More often then not, I have to add two extra columns to my data and perform lots of mini transactions (new transactions) followed by one large "update" which makes all the new records live, and/or deletes the records being replaced. The first column added ties a record to a particular update job and the second column indicates if that record is live or in the process of being added (this usually coincides with a database view over the data for the production instance which only shows "live" records). This is a particularly tricky operation, and performing cleanup if the batch process fails is also now not as easy as relying on a normal transaction to take on the processing. What I'd like to see is something to make this easier... any ideas?
I guess one option would be to generalise some of the code I wrote to do this and allow people to plug in their DAO of choice to actually store the records. The project I was working on previously was an open source biodiversity portal so there shouldn't be any issues as far as code sharing - just time to generalise the code into an acceptable solution.
Jan 3rd, 2008, 06:57 AM
I think you are describing a variant of the "process indicator pattern". We have seen this quite a lot, it is described in the user guide, and has a very basic implementation in the parallelJob sample. But it is definitely a concern that the framework can help with.
If I were you I'd wait till m4 is out before trying to merge in any code of your own, but we'd certainly be happy to accept contributions whenever you are ready (raise a JIRA and attach a patch with tests).