Results 1 to 3 of 3

Thread: Maintaining partition order

  1. #1
    Join Date
    Apr 2009
    Posts
    4

    Default Maintaining partition order

    I'm processing a series of files, which are named with datestamps. I need the filename because I've got to do some record keeping, so I followed the partitioning approach found in this thread.

    The issue is that I need the files processed in order, but this isn't happening. Although the files are partitioned in order (that is, the partitioned steps are named in the correct order), they're processed out of order (that is, the step IDs are not in the correct order).

    I believe that this is because after the files are partitioned and assigned to a step execution, the step executions are inserted into a hashmap. Of course, we aren't guaranteed any retrieval order out of the hashmap, and that's how the steps are processed in arbitrary order.

    The fix I employed was to create a version of MultiResourcePartitioner that uses LinkedHashMap instead of HashMap in partition(), and a version of SimpleStepExecutionSplitter that uses LinkedHashSet instead of HashSet in split(). This maintains the order.

    Does this sound reasonable?

  2. #2
    Join Date
    Jun 2005
    Posts
    4,230

    Default

    Quote Originally Posted by Aidan View Post
    Does this sound reasonable?
    I don't really understand how process ordering can be important for business reasons, and there is no way that a concurrent system can or should try to guarantee the order. Maybe I misunderstood the requirement. But if you say a LinkedHashMap is useful to you that's fine.

  3. #3
    Join Date
    Apr 2009
    Posts
    4

    Default

    Sorry, I was a bit unclear about the requirements in my original post.

    I'm processing catalog data, which is received daily. So I have a file from Monday, Tuesday, and Wednesday; these aren't really different files, though, they're essentially different versions of the same file. I need to process Monday, then Tuesday, then Wednesday (and in this case I can't simply discard Monday and Tuesday, because we can get partial updates).

    Ideally I process Monday's data on Monday, then Tuesday's on Tuesday. But there are cases where we need to process multiple days' worth in one run (for example, if the client was late getting the data to us).

    Partitioning the files works, now that I've got the partition order sorted out. Also, as you mentioned, I can't process them concurrently, so I use a SyncTaskExecutor.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •