5 Replies Latest reply on Mar 3, 2014 5:04 PM by andrew.morgan

    Clustered replicated live-backup not failing back on live restart

    andrew.morgan

      I'm using JBoss EAP 6.2 (JBoss 7.3 and HornetQ 2.3.12_FINAL). I've configured a live-backup pair on two separate servers using the attached configurations. All multicast addresses have been changed to the same one as our backend is configured to allow only one. I've confimed mutlicast communication between the two servers.

       

      The problem is the following:

      1. Start live server - no problems

      2. Start backup server - no problems, logs show replication from live to backup

      3. Kill (or shutdown) live server - no problems, logs show HornetQ starting up and JMS objects being bound to JNDI

      4. Restart live server: backup HornetQ server does not stop, "HQ212034: There are more than one servers on the network broadcasting the same node id" messages appear in the backup server log every 2 - 5 seconds until I shutdown the server.

       

      I've looked at other's examples of configurations that are working and can't see anything missing or significantly different in mine.

        • 1. Re: Clustered replicated live-backup not failing back on live restart
          jbertram

          The server you're configuring as your "live" needs to have this:

           

          <check-for-live-server>true</check-for-live-server>
          

           

          See Chapter 39. High Availability and Failover.

          1 of 1 people found this helpful
          • 2. Re: Re: Clustered replicated live-backup not failing back on live restart
            andrew.morgan

            Thanks for the quick response. My apologies, the uploaded file was not the only configuration I tried.

            The first attempt was as follows:

            Live

                       <hornetq-server>
                            <backup>false</backup>
                            <check-for-live-server>true</check-for-live-server>
                            <shared-store>false</shared-store>
                            <failover-on-shutdown>true</failover-on-shutdown>
                            <backup-group-name>TestingBackupGroup</backup-group-name>
            

            Backup

                        <hornetq-server>
                            <backup>true</backup>
                            <shared-store>false</shared-store>
                            <failover-on-shutdown>true</failover-on-shutdown>
                            <allow-failback>true</allow-failback>
                            <backup-group-name>TestingBackupGroup</backup-group-name>
            

             

            I've also tried with <allow-failback>true</allow-failback> on both nodes (as well as <check-for-live-server> on both).

            • 3. Re: Re: Re: Clustered replicated live-backup not failing back on live restart
              jbertram

              As a sanity test, I checked this myself just now using EAP 6.2.  I created 2 local instances starting with standalone-full-ha.xml on both.  Here's what I added or modified:

               

              Live:

                              <shared-store>false</shared-store>
                              <check-for-live-server>true</check-for-live-server>
                              <cluster-password>${jboss.messaging.cluster.password:secret}</cluster-password>
              

               

              Backup:

                              <shared-store>false</shared-store>
                              <backup>true</backup>
                              <allow-failback>true</allow-failback>
                              <max-saved-replicated-journal-size>10</max-saved-replicated-journal-size>
                              <cluster-password>${jboss.messaging.cluster.password:secret}</cluster-password>
              

               

              Then I:

              1. Started the live.
              2. Started the backup.
              3. Killed the live (i.e. via kill -9 <pid>)
              4. Observed the backup take-over.
              5. Started the live.
              6. Observed the backup cede to the live and become a backup again.

               

              I repeated steps 3-6 several times.  I did not observe any problems.

               

              Do these steps work for you?

              • 4. Re: Re: Re: Clustered replicated live-backup not failing back on live restart
                andrew.morgan

                Yes, that works perfectly both running two instances on my workstation (starting without a -b parameter) and running one on my workstation and another on my co-workers workstation (using -b $HOSTNAME).

                 

                The differences between these tests are several:

                1. The workstation tests are physical devices, the original tests were VMs
                2. The workstations have a single physical interface and VMs have 3 virtual ones.
                3. The workstations are connected to a physical switch and the VMs are bridged on the host (I think).

                I did a second test between the two workstations with all of the multicast addresses set to the same value (231.7.7.7) to eliminate that as a problem with the VM configuration.

                I'll talk to the sysadmin who set up the VMs to see how they are connected and try to determine if the problem is there.

                • 5. Re: Re: Re: Clustered replicated live-backup not failing back on live restart
                  andrew.morgan

                  Back from my vacation and a few hours of testing, and the problem is resolved.

                  Ran tests with tcpdump to examine the heartbeats passing between the servers. The UDP messages between live and backup appeared to be the same and fragmented no matter which direction they were travelling in, even though one hornetq could receive them and the other one couldn't.

                   

                  In addition to disabling multicast snooping on the HOST server as described in https://bugzilla.redhat.com/show_bug.cgi?id=880035,

                  echo 0 > /sys/class/net/virbr0/bridge/multicast_snooping

                  turning off udp-fragmentation-offload on the interface of the GUEST virtual machines allowed either server to see the other's heartbeat:

                  ethtool -K eth0 ufo off 

                   

                  Because the error was asymmetrical and the servers should have been identically configured, we will attempt the test again with freshly deployed VMs to determine if this is the case for all setups or if some undocumented configuration change on one of the two servers was causing the problem.