Page 1 of 2 12 LastLast
Results 1 to 10 of 14

Thread: CSV file process - multi-thread; which class to use?

  1. #1
    Join Date
    Jan 2005
    Posts
    10

    Question CSV file process - multi-thread; which class to use?

    Here is what I need to do:

    - Receive file ( large CSV file )
    - Import records using one TX but multiple threads ( split file? )
    - The import process uses existing Spring/Hibernate objects.
    - Commit or rollback depending on success
    - Do NOT need restart.

    QUESTION:
    What Class / Method / Strategy best using Spring Batch?
    Please advise.

    Thanks!

  2. #2
    Join Date
    Dec 2006
    Posts
    1,061

    Default

    Before diving into answering the specific points, I'm curious to ask why multiple threads are needed to load this file? What is the average file size to load? Is the large time to run the job related to processing? If so, using a staging table would be a good approach. In general, I would try loading the file using spring-batch without multi-threading (but using a huge commit interval, assuming the data is fairly clean) and if it isn't performing, start thinking about splitting the file, etc.

    To move on to specific points, you could kick off multiple threads from the ItemProvider, and there has been some discussion about this, but we don't have any concrete examples to refer you to. A TaskExecutorRepeatTemplate could be used at the Chunk level, or a CompositeItemProvider could be used, but there would be issues Synchronizing the file with the transaction, since Spring's TransactionSynchronizationManager stores it's classes to notify in a thread local.

  3. #3
    Join Date
    Jan 2005
    Posts
    10

    Default

    Processing time, in answer to your above question.
    If I split the files - why run serially?
    Prior to your reply I was considering using a queue and letting several threads work on it.

    Thanks for your thoughts on this.

  4. #4
    Join Date
    Dec 2006
    Posts
    1,061

    Default

    Are the files arriving split? or do you have to split them? If they're already split, then I agree that it makes sense to try and process them in parallel. If you want to do so within one job, you could use a queue in between the provider and the processor to help, but there would still be issues in synchronizing the disparate file input sources with the transaction.

    If it's processing time that takes awhile, and not the I/O, I would still recommend loading the file directly into a staging table, then doing the processing you need once the data has been loaded into the database.

  5. #5
    Join Date
    Jan 2005
    Posts
    10

    Default

    The file arrives UN-split.

    If each process/thread can have its own transaction, then which approach do you see best? What is the advantage of the staging table?

    I appologize for the questions. There are just a myriad of classes I see in the API which makes me need to understand the intended implementations for them.

    thanks!

  6. #6

    Default

    can each thread in a chunkOperations have it's own transaction (out-of-the-box)? I'm no expert, but from my (small) knowledge you may need to make sure the simpleStepExecutor's transaction manager is "dummy" and add transaction support around each repeatIterator (could be done with a RepeatInterceptor or with AOP around the chunkOperations Tasklet).

    Just my 2 cents.
    Regards
    AB

  7. #7
    Join Date
    Jul 2005
    Location
    Helsinki, Finland
    Posts
    12

    Thumbs up staging table

    Quote Originally Posted by lucasward View Post
    If so, using a staging table would be a good approach.
    staging table is a good idea if transactional integrity is required.
    You might want to use database temporary tables for staging. Here's one article about this approach for DB2 database:
    High performance inserts using JDBC Type 4 in a constrained environment: Leverage DB2 declared global temporary tables

  8. #8
    Join Date
    Jan 2005
    Posts
    10

    Default

    Ok, I am on board with that line of thinking.
    I am thinking through the pros/cons of using a Queue as the repository (aka database).

  9. #9
    Join Date
    Dec 2006
    Posts
    1,061

    Default

    Quote Originally Posted by epleisman View Post
    I am thinking through the pros/cons of using a Queue as the repository (aka database).
    Do you mean, using a Queue as your 'staging table'?

  10. #10
    Join Date
    Jan 2005
    Posts
    10

    Default

    Yes - Queue as staging table.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •