I'm trying to apply SI Feed to this problem: I need to consume an Atom feed from a remote server that has a fairly high traffic volume. It's very important that I not miss any feed entries.
Reading through the source for the feed:inbound-channel-adapter, this is how I understand the general process that occurs:
1) Read all the entries from the given feed URL into memory. This is likely to be the "current page" of an atom feed, yielding the most recent N entries.
2) Sort the entries by lastModified (if available) or publishedDate.
3) Discard entries from the head of the list until one is found with a lastModified/publishedDate that is after the "high water mark", which is stored in the MetadataStore.
4) Put all remaining entries into the queue of entries to be received by SI.
5) When polled, return the entry at the head of the queue. When we run out of entries, go back to 1).
Assuming my understanding is correct, I see one glaring problem. It seems highly possible, if not likely, that between subsequent invocations of step 1) on a high-volume feed, the page of feed entries being retrieved will have changed so much that some new entries will have fallen off to the second page, and SI doesn't seem to have a way to account for that.
Does anyone have a suggestion how to handle this problem? I've forked the SI repo and am going to play around with a couple of ideas of my own.