Page 1 of 2 12 LastLast
Results 1 to 10 of 14

Thread: Parallel downloading of files from FTP servers

  1. #1

    Cool Parallel downloading of files from FTP servers

    Hi,
    I am new to spring integration. We are currently using commons-vfs to download the files from the FTP servers and sometimes we need to download more than 20 files. Right now the downloading of files will happen in sequential meaning in a single thread. Is spring integration provides any features to download it in multiple threads? It will drastically reduces the time taken by our batch jobs.

    regards,
    Ram

  2. #2
    Join Date
    May 2007
    Location
    Netherlands
    Posts
    614

    Default

    If you would configure concurrency for the poller on an FtpSource this should work. Each receive() call would block until a download is finished and then give you a Message<File>. There is no reason why this shouldn't work if you call receive from multiple threads. Please try it out and let me know how it goes. We're currently busy with some refactorings in FtpSource, so your input might be right on time.

  3. #3
    Join Date
    Jul 2008
    Posts
    15

    Default

    I have a similar use case, where about 50 ftp servers are polled parallel. I read all needed ftpservers from a database and configure them dynamically. At the moment I only tested 2 in parallel, which works fine.
    However every FTPSource only processes 1 file on each poll, so the poll interval has to be small and for every file poll a new connection is established. (To me this smells like a bug, I expected the Source to poll until no messages are processed anymore. I'll open a JIRA entry for this).

    Regards,
    Maarten

  4. #4
    Join Date
    May 2007
    Location
    Netherlands
    Posts
    614

    Default

    Funny you mention this, my changes (which have not been checked in yet) actually remove the disconnect from poll so that the connection stays open. We have been discussing to use a connection pool of some sorts.

    Does it make sense to close the connection always if there are no more messages? This would still cause overhead when polling an empty ftp directory though.

    I'd love to hear your thoughts on this.

  5. #5
    Join Date
    Jul 2008
    Posts
    15

    Default

    If on every poll interval all remaining messages are polled, the polling interval of a ftp source in the usual use case probably will be much bigger than it is now (e.g. something like 1 hour or even more). So the connection should be closed when there are no more messages. (Of course it could be a configurable behaviour.)

    It would be a very good thing to have an influence on the maximal open ftp connections, e.g. by using something like a ftp connection pool. In our use case it is possible that we configure about 50 ftp sources which all connect to the same ftp server with different usernames/passwords. At some point some firewall could register this as a denial of service attack. Therefore some control on the maximal open ftp connections would be nice.

  6. #6
    Join Date
    May 2007
    Location
    Netherlands
    Posts
    614

    Default

    Quote Originally Posted by mdond View Post
    If on every poll interval all remaining messages are polled, the polling interval of a ftp source in the usual use case probably will be much bigger than it is now (e.g. something like 1 hour or even more). So the connection should be closed when there are no more messages. (Of course it could be a configurable behaviour.)
    I've modified the FtpSource to extend AbstractDirectorySource<List<File>>, meaning that it will return messages that contain lists of files. You can configure the size of the batches by setting maxFilesPerPayload on the FtpSource, the default is -1 (take all).

    There are some concerns with multi threading though.
    It would be a very good thing to have an influence on the maximal open ftp connections, e.g. by using something like a ftp connection pool. In our use case it is possible that we configure about 50 ftp sources which all connect to the same ftp server with different usernames/passwords. At some point some firewall could register this as a denial of service attack. Therefore some control on the maximal open ftp connections would be nice.
    Currently multithreading on the downloads doesn't work properly yet, but you can already try it out if you check out the head and build it yourself.

  7. #7
    Join Date
    Jul 2008
    Posts
    15

    Default

    There are some concerns with multi threading though.
    I didn't notice any multi threading problems with FTPSource. Are there concerns only with the new List<File> version or does this also apply to FTPSource in 1.0M5?

    I'll have a look at the new version.

  8. #8
    Join Date
    May 2007
    Location
    Netherlands
    Posts
    614

    Default

    The multithreading problem now is that if you call receive before onSend relating to the previous receive (if that makes sense) you'll get the same list of messages. I haven't got an integration test that exposes this in real life, but it is theoretically possible from the FtpSource's point of view. You can see this if you move the onSend call in FtpSourceTests.retrieveMaxFilesPerPayload(). If I committed that in the mean time that is.
    Last edited by iwein; Jul 30th, 2008 at 10:07 AM. Reason: premature posting

  9. #9
    Join Date
    Jul 2008
    Posts
    15

    Default

    The multithreading problem now is that if you call receive before onSend relating to the previous receive (if that makes sense) you'll get the same list of messages.
    It doesn't make sense to poll again before the last poll has finished, does it? So this shouldn't happen.

    Can it happen at the moment? SourceEndpoint.poll() isn't synchronized, but probably it doesn't have to, probably one thread is dedicated to polling the SourceEndpoint and won't poll again until the last poll has been finished. I hope it is implemented this way?

    It doesn't make sense to solve this in FTPSource or SourceEndpoint, I think. If the polling schedule is too small to poll the source regularly and more than one thread would start this polling, this would result in a growing amount of threads waiting for each other, which results in a disaster anyway.

  10. #10
    Join Date
    May 2007
    Location
    Netherlands
    Posts
    614

    Default

    Currently the source is intended to be used from a single thread. Since the poller is going to support concurrent scenarios, it is theoretically possible to hook up concurrent workers to the same source. This might be wanted behavior if there are many small files on the remote directory and you want to download them in parallel to optimize network usage.

    So we can't keep it single threaded in general, we need to make the receive method thread safe somehow. Reading your comment though I'm wondering if maybe just synchronising receive() would be good enough for now?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •