Apr 12th, 2008, 09:09 AM
File sorting in batch
In our cobol mainframe program, which we are migrating to java, there is a batch which gets the input file (in text format) and then sorts it in a particular oder (same as order by in sql query).
Sometimes it orders by first three characters, followed by 24-25 characters and then 50-55 characters in each line. (multiple column ordering). So the whole file is reorderd in particular format.
I do not know whether this is spring batch related question, but wanted to check whether gurus out there aware of doing this using Spring Batch? This needs to be a separate job, so it has to be written in Spring Batch.
OR best mechanism is to load each record in the file to a database and then fire an order by query and write the result to a new file?? I am worried about the performance here.
Apr 12th, 2008, 09:03 PM
Before diving into possible solutions, I'm extremely curious as to what the use case is that requires you to order the flat file?
Apr 12th, 2008, 10:07 PM
I do not have a right answer to that question! Our current batch programmers (mainframe/cobol) are doing the whole sorting and deletion of records in the file itself without any database interaction.
As per my knowledge best way to do it is read the whole data into a database and then do the processing, but curious to find out whether there are any possibilities to do it in the file itself, any fast utilities or methods available?
If not, then I believe it is not a better option to spend much time on it rather load the data into database.
Apr 12th, 2008, 11:35 PM
Loading into the database first is certainly the easiest solution, especially in Spring Batch, since you could take advantage of its ItemReaders and ItemWriters. You'd hardly have to write any code. (Assuming you don't have any business logic to apply, which it sounds like you don't)
However, it depends on the size of the file and where this process fits in your batch solution. If the file is reasonably small (i.e. not 40 gigs) it wouldn't really matter that you've put the data in a database first, even if you're immediately going back to a file, it would probably be fast enough regardless. However, even if the file is larger it depends on how the result is going to be used. If the file is just sorted and then uploaded to another system, you might even think about using something like the unix sort command and upload it. However, if sorting the file is the first step in a large 'stream' that works off the same file, you might gain an advantage by loading all the data into the database and working on it completely from there (assuming you're able to rewrite the other jobs as well)
Sorry to answer your question with 'it depends'. There's not any file sorting utility in Spring Batch or java in general that I'm aware of. I personally like to get data into the database as fast as possible. In my experience supporting Batch applications, operations that work with files tend to be the most brittle. The majority of the time that I was paged at 3 in the morning it was because some job that was dealing with files had an error. Although, it still depends upon the situation, and at times you have to be pragmatic.
Apr 13th, 2008, 09:17 AM
Thanks a lot for your input!! Since we would be doing lot more than with that data that is being read I believe I would go with reading the file data into database first and then process it.
If it is a huge performance hit, then might think for other alternative.
I completely agree with your point that updating data in file while we read it etc will be very delicate and brittle operation and it is better to not get called at 3:00 AM in the morning.
Thanks again for your help!
Apr 13th, 2008, 05:23 PM
Sorting and filtering files
For what it's worth I have found that sorting and filtering files can speed up the process even if you choose to stage the data into database tables for processing. I know on project that used syncsort - http://www.syncsort.com/products/ss/home.htm - and another that I was on simply used the unix sort facilities. However, I don't believe the unix sort utility will filter records out of the dataset, which would make syncsort your better option. Regardless, sorting the files before loading into the database significantly improved performance.
Apr 17th, 2008, 02:12 AM
We were having a similar concern about file sorting. And we have to cope with the fact that the host can sort 8 million records (say, 300GB) in about a minute.
Assuming we could try to find an approach to solving this problem, we are more than certain that we'd have to engage into parallel processing of the file. We're not actually dealing with it (yet) though.
Nov 30th, 2012, 09:52 AM
Did any one find a spring batch based solution for the sorting ; I have a similar sorting that need to be done on the fly to check-sum and compare 2 files
Nov 30th, 2012, 10:23 AM
In my past experience, when we had to sort a file (which we avoided as best we could using a number of techniques), the syncsort product mentioned above was the best option. Otherwise, the easiest approach in a purely spring batch option would be to import it to a database table then generate a new file.
Obviously you could write a tasklet that sorts a file but you'd be responsible for reading the entire file into memory, sorting it, etc.
Jan 17th, 2013, 02:17 PM
I'm trying to get my head around how I might use Spring Batch to replace Informatica, a GUI centric Extract / Transform / Load tool, commonly used to load a Data Warehouse. In Informatica, a mapping contains one or more sources (similar to ItemReaders) and one or more targets (similar to ItemWriters). In between the source(s) and target(s) are zero or more transformations, similar to ItemProcessors. Common transformations include sorting, aggregating, joining, filtering, lookup, filtering and routing (there are a number of others). I see how an ItemProcessor could be used to implement filtering and routing. A lookup transformation (given a natural key, go find the dimension key) is also pretty straight forward, using a key/value db.
Originally Posted by mminella
I'm struggling with how Spring Batch would model the transformations that have to operate on sets of data, such as sorting and aggregating. The suggested solution is to send the set to the database, do the sort and write the results to a flat file. That sure seems like a lot of network traffic, especially when the sets contain millions of records. Would it be a better use of Spring Batch just to get all of the sources into the DB and then write some stored procedures to do the Transformations and load the target tables? This way, once the data is in the DB, it stays in the DB.