May 14th, 2011, 08:30 PM
Middleware vs. multi-threaded approach
I am writing a one-shot migration tool to transfer critical data from a legacy system. The tool will be processing around 2 million files of around 50Tb total.
Because the anticipated execution time is large (several months, based on prior migration exercises), I am hoping to use some sort of chunked processing. The Spring Batch FAQ indicates that using middleware (JMS, etc.) is highly beneficial, but I'm not sure if it's overkill in my situation and whether a multi-threaded approach would do just as well.
The migration tool does minimal data processing, and the bottleneck is anticipated to be retrieving the data from the legacy system and having it picked up by the new system. To this end, the tool will be running on the same disk as the legacy system.
Therefore, I anticipate that physically distributed processing afforded by middleware will hinder rather than assist. If all consumers will be running on the same box, is there any benefit in using this approach? I've never had any practical experience with JMS and the like before, so I'm not sure if there's other robust functionality that I wouldn't be getting by using Spring Batch alone.
May 15th, 2011, 02:20 AM
Well it all depends on what is your known bottleneck. If you think that a single machine is enough for your need configure a partition with a task executor and have a set of threads running in parallel.
If you need to scale to more than one machine, more solutions are available.
But if your job can run on a single one, don't bother considering those techniques for now.
May 15th, 2011, 04:00 AM
Thanks Stéphane; much appreciated. I thought that was the case but just wanted to clarify as both Spring Batch and the whole notion of middleware is new territory for me.
Tags for this Thread