Oct 1st, 2007, 02:39 AM
InputSource vs. ItemProvider
Currently there are two "input" interfaces InputSource and ItemProvider. Often the ItemProvider is nothing more but a trivial wrapper around InputSource.
Wouldn't it make sense to remove the InputSource interface completely and refactor all existing InputSources to implement ItemProvider? Or is the distinction important?
I think conforming to single interface would remove boilerplate from typical configurations which currently use the InputSourceItemProvider and wouldn't hurt flexibility - custom ItemProviders can wrap the standard ones. Also there would be no need to wonder what makes one class ItemProvider and other InputSource - I guess the difference is not really clear.
Just to note, the situation is similar for ItemProcessor and OutputSource.
Oct 2nd, 2007, 03:42 PM
I have been having some of the same thoughts lately, especially considering BATCH-140:
With this fix, the FileInputSource will no longer return a field set that needs to be mapped, but will return a mapped object. Once this is done, all input sources will return object, which is the same as what ItemProvider.next() returns.
With the similarities, I can understand some confusion as to why two interfaces are needed. And perhaps they should be combined. However, I have a couple issues with just using ItemProvider or ItemProcessor and ditching InputSource and OutputSource:
My first issue is the scenario when there needs to be business logic in between reading or writing. For example, if you read in a line from a file, and use the data to read in other information from say, the database. The same scenario would exist for business logic that should be applied before writing the record out. This could be accomplished with CompositeItemProvider, assuming we're okay with recommending the pattern with Developers. We would probably need good documentation, but it's not necessarily a bad thing. The same would be true for processing, with a Composite ItemProcessor.
The second issue is one of semantics. Having a FlatFileItemProvider that is essentially what the FlatFileInputSource is seems like it might be confusing. This is because input source is saying something about what the class is: the source of input. Whereas ItemProvider is much more generic. It seems to me that users of the framework might be confused about what an ItemProvider actually is for this reason.
What I think it really boils down to is: do we want to have two explicit concepts in the framework, one for providing input, and another for taking that input, validating it based on business requirements, and gathering any other data? Or do we want one concept that can expand to mean something similar, but not quite the same? Meaning, Composite ItemProviders or ItemProcessors. It perhaps doesn't seem like a big issue on input, but with Output, there will almost always be business logic before actually writing the record.
Personally, I feel like InputSource/OutputSource and ItemProvider/ItemProcessor are two separate batch domain concepts that I would like to keep separated. However, there is a lot of overlap, and I'm not sure I feel really strongly one way or the other. Perhaps a renaming of ItemProvider/ItemProcessor could help?
Oct 3rd, 2007, 10:25 AM
User Entry Point
There is an important distinction between the two patterns:
An ItemProvider just gives you objects one at a time, as would an iterator - it does not provide any background behavior or mechanisms for resource parsing.
An InputSource is designed to allow you to manually control (via the creation of Tokenizers, etc) the way in which a particular item is created from a resource.
I personally don't use the InputSource pattern in any of my jobs so far, as I am dealing mainly with parsing data from legacy COBOL systems, so there is no easy way to write a Tokenizer for that - I simply wrote an ItemProvider that gives me one line of a file at a time as a String and an ItemProcessor that handles the conversion of Strings to objects, etc. I am currently of the opinion that the ItemProviderProcessTasklet (or the Restartable version thereof) is the most practical and useful pattern for most of an end-user's needs.
I think that as a matter of best practices, it is important to distinguish between infrastructure pieces that should not be touched by users and user entry-point classes, and that it makes a lot more sense to make the ItemProvider/ItemProcessor paradigm the preferred entry point for users, as it is the least restrictive pattern and does not tie the user to the concept of a 'resource' as the source for data or the target (hear that, Lucas? TARGET ) for output.
To that end, I would say that the value of the InputSource/OutputSource paradigms is such that they should not disappear, but perhaps become infrastructural components and be wrapped for end-users in framework-provided classes -- ie. "InputSourceItemProvider" and "OutputSourceItemProcessor" -- which could then be used to perform custom tokenization and field mapping, while still maintaining a single simple yet flexible design pattern for users to utilize in designing jobs.
Whether or not the "ReadProcessTasklet" remains useful in this case is another topic with which I won't pollute this thread.