Sep 5th, 2007, 09:29 AM
CSV file process - multi-thread; which class to use?
Here is what I need to do:
- Receive file ( large CSV file )
- Import records using one TX but multiple threads ( split file? )
- The import process uses existing Spring/Hibernate objects.
- Commit or rollback depending on success
- Do NOT need restart.
What Class / Method / Strategy best using Spring Batch?
Sep 5th, 2007, 11:01 AM
Before diving into answering the specific points, I'm curious to ask why multiple threads are needed to load this file? What is the average file size to load? Is the large time to run the job related to processing? If so, using a staging table would be a good approach. In general, I would try loading the file using spring-batch without multi-threading (but using a huge commit interval, assuming the data is fairly clean) and if it isn't performing, start thinking about splitting the file, etc.
To move on to specific points, you could kick off multiple threads from the ItemProvider, and there has been some discussion about this, but we don't have any concrete examples to refer you to. A TaskExecutorRepeatTemplate could be used at the Chunk level, or a CompositeItemProvider could be used, but there would be issues Synchronizing the file with the transaction, since Spring's TransactionSynchronizationManager stores it's classes to notify in a thread local.
Sep 5th, 2007, 01:53 PM
Processing time, in answer to your above question.
If I split the files - why run serially?
Prior to your reply I was considering using a queue and letting several threads work on it.
Thanks for your thoughts on this.
Sep 5th, 2007, 06:48 PM
Are the files arriving split? or do you have to split them? If they're already split, then I agree that it makes sense to try and process them in parallel. If you want to do so within one job, you could use a queue in between the provider and the processor to help, but there would still be issues in synchronizing the disparate file input sources with the transaction.
If it's processing time that takes awhile, and not the I/O, I would still recommend loading the file directly into a staging table, then doing the processing you need once the data has been loaded into the database.
Sep 6th, 2007, 07:44 AM
The file arrives UN-split.
If each process/thread can have its own transaction, then which approach do you see best? What is the advantage of the staging table?
I appologize for the questions. There are just a myriad of classes I see in the API which makes me need to understand the intended implementations for them.
Sep 6th, 2007, 09:51 AM
can each thread in a chunkOperations have it's own transaction (out-of-the-box)? I'm no expert, but from my (small) knowledge you may need to make sure the simpleStepExecutor's transaction manager is "dummy" and add transaction support around each repeatIterator (could be done with a RepeatInterceptor or with AOP around the chunkOperations Tasklet).
Just my 2 cents.
Sep 6th, 2007, 05:04 PM
staging table is a good idea if transactional integrity is required.
Originally Posted by lucasward
You might want to use database temporary tables for staging. Here's one article about this approach for DB2 database:
High performance inserts using JDBC Type 4 in a constrained environment: Leverage DB2 declared global temporary tables
Sep 7th, 2007, 09:07 AM
Ok, I am on board with that line of thinking.
I am thinking through the pros/cons of using a Queue as the repository (aka database).
Sep 7th, 2007, 09:53 AM
Do you mean, using a Queue as your 'staging table'?
Originally Posted by epleisman
Sep 7th, 2007, 11:30 AM
Yes - Queue as staging table.