Results 1 to 5 of 5

Thread: multi threaded FlatFileItemReader

  1. #1

    Lightbulb multi threaded FlatFileItemReader

    I am busy handling a huge CSV file in input where the staging functionality may be a killer as the actual processing is very simple (sql insert).

    I wanted to have a partition where each partition would stream part of the file (line 0 to line 9999, line 10000 to 19999, etc). The restart ability of each partition could be guaranteed by storing the chunk boundary in the execution context and streaming again the file upon restart (start + read.count)

    If it's not built-in in Spring batch, I'd like to understand what's wrong with my reasoning or is it simply a valid feature that is not implemented?

    Thanks!

  2. #2
    Join Date
    Jun 2005
    Posts
    4,241

    Default

    It should work as you describe just fine, but we never provided support for this explicitly in the framework because I'm not convinced it helps. Especially if the output side of the transaction is quite simple, I'm not sure if it will help to partition the file this way because all partitions will have to read the file up to the point where they can start processing anyway. If you try it and it helps let me know.

  3. #3

    Default

    Quote Originally Posted by Dave Syer View Post
    It should work as you describe just fine, but we never provided support for this explicitly in the framework because I'm not convinced it helps. Especially if the output side of the transaction is quite simple, I'm not sure if it will help to partition the file this way because all partitions will have to read the file up to the point where they can start processing anyway. If you try it and it helps let me know.
    Hi Dave,

    We did implement the multi thread file reader and the related partitioner to give it a shot. The partitioner is restricted to one line = one item but it can be quite easily extended.

    Here are the results with a CSV file containing one million entries. The job only reads the item, parses it to a vo and passes it to a JPA dao backed by Hibernate.

    • Single thread (standard SB reader) commit-interval=5 : 15 min 54 sec
    • Single thread (standard SB reader) commit-interval=50: 7 min 44 sec
    • Single thread (standard SB reader) commit-interval=100: 7 min 06 sec
    • Multi-reader with a partition gridSize=10, commit-interval=50 : 6 min 18 sec
    • Multi-reader with a partition gridSize=4, commit-interval=50 : 3 min 47 sec


    You are right, the IOs are completely saturated when the threads started to move the cursor at the right spot. Once this is done, processing is obviously much faster.

    If you are interested by the code of the reader and the partitioner, my company is happy to contribute it back.

    Best,
    Stéphane

  4. #4
    Join Date
    Jun 2005
    Posts
    4,241

    Default

    Interesting data. Can you make a JIRA and post your code there?

  5. #5

    Default

    Here you go:
    https://jira.springframework.org/browse/BATCH-1613

    This was on a MacBookPro with an SSD drive by the way but we also have a benchmarking infrastructure with no-SSD drives. The setup is slightly different but the performance increase factor is almost similar.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •