Page 1 of 3 123 LastLast
Results 1 to 10 of 22

Thread: Wishlist / Coding Examples for the following...

  1. #1
    Join Date
    Apr 2005
    Location
    New York
    Posts
    35

    Default Wishlist / Coding Examples for the following...

    I've started evaluating Spring Batch and it looks promising. I've worked on several projects that do a fair amount of file batch processing. From my experience, there are a some features that the framework needs to support before I can comfortably recommend that our group use it.


    Specifically, does the framework support:
    • native database bulk-copy commands like Sybase "bcp" or Oracle "bulk load"? For us it's fine if this breaks the transaction boundary demarcations.
    • multiple record types (i.e. header/detail-rectype-1/detail-rectype-2/trailer)?
    • optional vs required fields (I assume you'd have to use something like the ValidatingItemProvider for this)
    • optional vs required record types (I assume you'd have to use something like the ValidatingItemProvider for this)
    • field padding (left vs right and padding char)
    • field masks (i.e. mask="MM/dd/yy", or masks similar to the java.text.Format)
    • field-by-field default values for empty/null fields (i.e. if field1 is empty or blank, default it to today)
    • On delimited files, what about files that use different delimeters for separating each field (i.e. field1~field2|field3|field4!lastfield/n). Perhaps the FieldSet class could contain an attribute for that info?
    • Record separators for files that don't use a CR or CR/LF for the end of the line (i.e. field1|field2|field3|field4!lastfieldinrecord~). Perhaps the solution is to use the RecordSeparatorPolicy and/or SuffixRecordSeparatorPolicy?
    • What if you want to do multi passes of the file - one to validate it (especially useful for files containing multiple record types), then one to process it?



    I'm very familiar with Spring but Spring Batch is totally new to me so perhaps it does support what I'm asking and I just overlooked how to accomplish what I'd like.

    Can someone please help point me at an example of how to do some of the things I have a question on?
    Tony Falabella

  2. #2

    Default

    A few quick pointers:

    native database bulk-copy commands like Sybase "bcp" or Oracle "bulk load"?
    no, framework does not provide support vendor-specific database features

    multiple record types (i.e. header/detail-rectype-1/detail-rectype-2/trailer)?
    take a look at multilineOrderJob

    On delimited files, what about files that use different delimeters for separating each field (i.e. field1~field2|field3|field4!lastfield/n).
    you'll want to implement a custom LineTokenizer

    What if you want to do multi passes of the file
    consider making each passing of the file a separate step in the job

  3. #3

    Default

    native database bulk-copy commands like Sybase "bcp" or Oracle "bulk load"?
    a tasklet that executes a system command might be what you are looking for http://jira.springframework.org/browse/BATCH-152

  4. #4
    Join Date
    Dec 2006
    Posts
    1,061

    Default

    native database bulk-copy commands like Sybase "bcp" or Oracle "bulk load"? For us it's fine if this breaks the transaction boundary demarcations.
    As Robert mentioned, there isn't platform specific item readers, however, there is no reason why you can't call an oracle specific class. I've worked with multiple clients that have done so easily.

    multiple record types (i.e. header/detail-rectype-1/detail-rectype-2/trailer)?
    There is a sample job for this (multi-line job). All you need is to define a LineTokenizer for each record type.

    optional vs required fields (I assume you'd have to use something like the ValidatingItemProvider for this)
    This is something we've discussed, and is definitely possible with the FixedLengthTokenizer by not including a range within the column definition. However, it isn't possible with the DelimitedLengthTokenizer. You could not map a field to a particular object, but with 'automapping' there would be issues. It should be added as an issue in Jira. Can you add one with an example business case where you use optional fields?

    optional vs required record types (I assume you'd have to use something like the ValidatingItemProvider for this)
    By default, if you use the PrefixMatchingCompositeLineTokenizer, ever record type would be optional. However, you could easily write your own LineTokenizer that knows which record types are optional or required.

    field padding (left vs right and padding char)
    Padding should work for input (see BATCH-261). And there are setters for padding of fields in the FixedLengthAgreggator. However, it should probably be more fine grained that it is currently.

    field masks (i.e. mask="MM/dd/yy", or masks similar to the java.text.Format)
    Supported

    field-by-field default values for empty/null fields (i.e. if field1 is empty or blank, default it to today)
    Tokenizer's don't do this by default, although a FieldSetMapper that you write could easily do it.

    On delimited files, what about files that use different delimeters for separating each field (i.e. field1~field2|field3|field4!lastfield/n). Perhaps the FieldSet class could contain an attribute for that info?
    There is a setter for the delimiter type in the DelimitedTokenizer, however, it will be used for every field in the file. I'm curious what the use-case would be for having multiple delimiters per file?

    Record separators for files that don't use a CR or CR/LF for the end of the line (i.e. field1|field2|field3|field4!lastfieldinrecord~). Perhaps the solution is to use the RecordSeparatorPolicy and/or SuffixRecordSeparatorPolicy?
    There is a RecordSeperatorPolicy as part of the FlatFileItemReader.

    What if you want to do multi passes of the file - one to validate it (especially useful for files containing multiple record types), then one to process it?
    You could easily have multiple steps that correspond to these 'passes'?

  5. #5
    Join Date
    Apr 2005
    Location
    New York
    Posts
    35

    Default

    Robert and Lucas,

    Thanks for all the info - that's very helpful. I'll start to try out your suggestions a bit tonight and will try to open that Jira ticket for the optional/required fields in the next day or two.

    As far as Lucas's question regarding:

    There is a setter for the delimiter type in the DelimitedTokenizer, however, it will be used for every field in the file. I'm curious what the use-case would be for having multiple delimiters per file?
    I guess a usecase would be we have a file that we use pipe delimiters for except for delimiting a few of the fields in the record since those fields might themselves contain pipes. It's not a great example, since one could argue that we should pick a delimiter like x00 or the like that we're guaranteed to never encounter in any of our fields, but unfortunately we're limited to the chars that the system outputting the datafile can generate. I've also used the batch processing tool Ab Initio (look it up on Wikipedia if you're not familar with it) and it's able to handle various chars for delimiting each field.

    I'll let you know how I make out.
    Tony Falabella

  6. #6
    Join Date
    Dec 2006
    Posts
    1,061

    Default

    Interesting, I've seen a lot of projects pick pipe over comma delimited because of the likelihood of commas being part of the data, but usually pipes are relatively safe. It could be added to the Tokenizer, but it seems like a minority use case and probably out of scope for Release 1. However, please add it to JIRA, and if a lot of others need the feature, it could be moved up.

    Also, if a reliable delimiter can't be chosen, is using a fixed-length format a possibility?

  7. #7
    Join Date
    Aug 2006
    Location
    Now Germany, previously Ukraine
    Posts
    1,546

    Default

    Quote Originally Posted by lucasward View Post
    Interesting, I've seen a lot of projects pick pipe over comma delimited because of the likelihood of commas being part of the data, but usually pipes are relatively safe. It could be added to the Tokenizer, but it seems like a minority use case and probably out of scope for Release 1. However, please add it to JIRA, and if a lot of others need the feature, it could be moved up.

    Also, if a reliable delimiter can't be chosen, is using a fixed-length format a possibility?
    Then why do not support echoing like "\\" in Java (i.e. single delimiter is a delimiter, doubled delimiter is a literal value of a single delimiter). And normally it is not so complicated to double delimiters inside the fields on output - not more complicated then use different delimiters for different fields. And this solution is 100% safe.

    Regards,
    Oleksandr

  8. #8
    Join Date
    Jun 2005
    Posts
    4,230

    Default

    N.B. The DelimitedLineTokenizer adopts the Microsoft-inspired convention that a field containing a delimiter (line or field delimiter) can be escaped by quoting it. Inside such a field a quote character is escaped by repeating it. This is what you get from Excel (for instance) when you do Save As... -> CSV, so it covers a large constituency already. The Javadocs mention this behaviour in the setter for the quote character (which defaults to ").

  9. #9
    Join Date
    Aug 2006
    Location
    Now Germany, previously Ukraine
    Posts
    1,546

    Default

    It is good that this convention is supported, but it seems to be slightly overcomplicated - simple doubling of the delimiter is simpler to produce and to parse. And should provide (marginally) better performance which may be not so bad in batch applications. As well processing of quoted string requires virtually unlimited look-ahead (especially, if file being processed is misformated), simple delimiter duplication requires only single-character look-ahead and is much safer in this respect.

    So it is quite reasonable to support such strategy as well. Anyway, it is not very likely (while still possible), that CSV files for batch-processing would be created by Excel, as Excel is mostly interactive tool.

    Regards,
    Oleksandr

    Quote Originally Posted by Dave Syer View Post
    N.B. The DelimitedLineTokenizer adopts the Microsoft-inspired convention that a field containing a delimiter (line or field delimiter) can be escaped by quoting it. Inside such a field a quote character is escaped by repeating it. This is what you get from Excel (for instance) when you do Save As... -> CSV, so it covers a large constituency already. The Javadocs mention this behaviour in the setter for the quote character (which defaults to ").

  10. #10
    Join Date
    Dec 2006
    Posts
    1,061

    Default

    One thing I would also like to point out is that delimiters are set for an entire file. Meaning, that there is one setter for a delimiter that is used. Attempting to set a delimiter per field would require significantly more configuration than there is currently, for very little value. At a minimum, if this feature is needed, it would need to be a separate tokenizer all together, so that the more common use case would be easier to configure. However, with that being said, I still don't understand what setting a delimiter per field would add that couldn't more easily be accommodated by using fixed-length formatting.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •