Results 1 to 4 of 4

Thread: Failure recovery, total loss of a broker

  1. #1
    Join Date
    Aug 2004
    Location
    London, UK
    Posts
    339

    Default Failure recovery, total loss of a broker

    Hi,

    I'm testing the AMQP/Rabbit code, specifically in failover and recovery situations. My test client uses a MessageListenerContainer and a POJO to receive messages which I send to the broker cluster from a different client or host. My rabbit cluster (3 machines) is behind a load balancer that detects if a broker is up or down using a port check every 10 seconds. The cluster and LB work fine in normal situations.

    If my Spring client detects a shutdown of the broker it is connected to, then it logs the shutdown exception, reconnects (back through the same VIP on the LB) to a different cluster member and continues to receive messages - this happens very quickly.

    However, if the connection between client and cluster is simply severed (remove a network cable or just power off the broker machine without any server or OS shutdown) then the client doesn't reconnect at all and simply stops receiving messages. It doesn't seem to detect the loss of connectivity ever, even with a restart of the broker or restore the network link. Only if I kill the client app and restart it will it obtain a new connection to the cluster and receive all of the messages that had built up in the cluster.

    Is this a limitation of the AMQP or Rabbit Spring code, or the broker, or something that can be configured in the client side?

  2. #2
    Join Date
    Mar 2010
    Location
    Gtr Philadelphia, PA
    Posts
    2,020

    Default

    This is a classic problem with TCP connections.

    You can enable heartbeats on the underlying Rabbit ConnectionFactory...

    http://www.rabbitmq.com/javadoc/com/...Heartbeat(int)

    http://rabbitmq.1065348.n5.nabble.co...at-td1977.html
    Gary P. Russell
    Spring Integration Team
    SpringSource, a division of VMware

  3. #3
    Join Date
    Aug 2004
    Location
    London, UK
    Posts
    339

    Default

    Thanks Gary. I tried that and I get very erratic results with it; certainly nothing that would give me any confidence about using it in a production system. The heartbeat errors show up after the configured heartbeat interval that was set and the consumer then reconnects to a different server via the LB. But then the consumer begins to receive messages at a much slower rate than they are being produced, and only receives every second message. This is with a producer that has a stable, unbroken connection to the cluster and is sending messages at a rate of 1 every 250ms in my test.

    Only after the failed server machine is restored to the cluster does the consumer catch up with both the messages that appeared to go missing, and the production rate. Very odd, but repeatable every time I run the tests and even with a completely rebuilt cluster.

    D.
    Last edited by davison; Oct 27th, 2012 at 02:54 PM.
    Darren Davison.
    Public Key: 0xE855B3EA

  4. #4
    Join Date
    Mar 2010
    Location
    Gtr Philadelphia, PA
    Posts
    2,020

    Default

    Hmmm... can you share a debug (or preferably TRACE) level log for the consumer showing good->erratic->good ??

    Also, given that spring-amqp is a thin layer on top of the rabbit client, this is something you might want to bring up on the rabbit list (https://lists.rabbitmq.com/cgi-bin/m...bbitmq-discuss), but I'd be happy to take a quick look at a spring-amqp log if you like.
    Gary P. Russell
    Spring Integration Team
    SpringSource, a division of VMware

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •