Aug 3rd, 2010, 08:05 AM
multi threaded FlatFileItemReader
I am busy handling a huge CSV file in input where the staging functionality may be a killer as the actual processing is very simple (sql insert).
I wanted to have a partition where each partition would stream part of the file (line 0 to line 9999, line 10000 to 19999, etc). The restart ability of each partition could be guaranteed by storing the chunk boundary in the execution context and streaming again the file upon restart (start + read.count)
If it's not built-in in Spring batch, I'd like to understand what's wrong with my reasoning or is it simply a valid feature that is not implemented?
Aug 5th, 2010, 11:25 AM
It should work as you describe just fine, but we never provided support for this explicitly in the framework because I'm not convinced it helps. Especially if the output side of the transaction is quite simple, I'm not sure if it will help to partition the file this way because all partitions will have to read the file up to the point where they can start processing anyway. If you try it and it helps let me know.
Aug 9th, 2010, 08:47 AM
Originally Posted by Dave Syer
We did implement the multi thread file reader and the related partitioner to give it a shot. The partitioner is restricted to one line = one item but it can be quite easily extended.
Here are the results with a CSV file containing one million entries. The job only reads the item, parses it to a vo and passes it to a JPA dao backed by Hibernate.
• Single thread (standard SB reader) commit-interval=5 : 15 min 54 sec
• Single thread (standard SB reader) commit-interval=50: 7 min 44 sec
• Single thread (standard SB reader) commit-interval=100: 7 min 06 sec
• Multi-reader with a partition gridSize=10, commit-interval=50 : 6 min 18 sec
• Multi-reader with a partition gridSize=4, commit-interval=50 : 3 min 47 sec
You are right, the IOs are completely saturated when the threads started to move the cursor at the right spot. Once this is done, processing is obviously much faster.
If you are interested by the code of the reader and the partitioner, my company is happy to contribute it back.
Aug 10th, 2010, 02:54 AM
Interesting data. Can you make a JIRA and post your code there?
Aug 10th, 2010, 07:15 AM
Here you go:
This was on a MacBookPro with an SSD drive by the way but we also have a benchmarking infrastructure with no-SSD drives. The setup is slightly different but the performance increase factor is almost similar.