-
1. Re: initial connection may fail if noe already failed over
ataylor Feb 3, 2010 4:51 AM (in response to ataylor)comments anyone! -
2. Re: initial connection may fail if noe already failed over
radhikasivaraj Feb 8, 2010 4:51 AM (in response to ataylor)HI, Should I apply this fix in createSession method of org.hornetq.core.client.impl.FailoverManagerImpl.java ? Thinking whether I can build the code myself with this fix from the 2.0.0 GA java source. Pls advice.
-
3. Re: initial connection may fail if noe already failed over
timfox Feb 8, 2010 4:55 AM (in response to ataylor)It isn't fixed yet. We're still discussing it -
4. Re: initial connection may fail if noe already failed over
ataylor Feb 9, 2010 3:58 AM (in response to timfox)Just a few thoughts. Obviously if we allow clients to failover if an initial connection fails we may create a split brain situation if the original server is still actually active. One thing we could do is for the secondary server to return a specific code if it hasnt failed over yet. In this instance the client would then retry the first server. problem here is that if there were no current live clients to force the failover it would still never connect. Is it possible somehow for the secondary server to check the status of the first server. I know at some point there was a discussion about the master server writing its status to a shared file, this would work on the shared store configuration but would we be able to detect when replication was being used? -
5. Re: initial connection may fail if noe already failed over
clebert.suconic Feb 9, 2010 12:42 PM (in response to ataylor)"if an initial connection fails we may create a split brain situation if the original server is still actually active. One thing we could do is for the secondary server to return a specific code if it hasnt failed over yet. In this instance the client would then retry the first server"
Isn't this the split brain quorum task? https://jira.jboss.org/jira/browse/HORNETQ-66 ?
Also, if we make any changes to failover and backups.. keep in mind we will need to install backup nodes in the middle of the operation, in order to support adding backup to a live node:
-
6. Re: initial connection may fail if noe already failed over
ataylor Feb 9, 2010 1:26 PM (in response to clebert.suconic)its related to HORNETQ-66 but not the same. Its more to do with the order in which nodes are initially chosen to connect to. I dont think its related at all to the second. -
7. Re: initial connection may fail if noe already failed over
timfox Feb 17, 2010 4:15 AM (in response to ataylor)Sorry, for coming so late to this discussion.
Andy, I think what you say makes sense.
To summarise:
We introduce a new param, failoverOnInitialConnection.
When a client starts it will attempt reconnectAttempts times to connect to the live node. If it has not connected after reconnectAttempts attempts then, if failoverOnInitialConnection = true it will attempt to connection to the backup, if specified (do we try the backup also reconnectAtttempts times?), otherwise it will fail.
By default failoverOnInitialConnection will be false.
We have to be a bit careful with failoverOnInitialConnection=true since in an environment where you have a symmetric cluster of nodes, each with backups, each live node on startup will try to make cluster connections to each other node.
When the cluster is brought up, if live nodes aren't brought up in time, node N could instead make connections to backup nodes putting the cluster in an inconsistent state. This is actually the reason why we currently don't always try the backup if the live is not available. But I think if this is guarded by a flag it should be ok.
Regarding the split brain stuff, I think that is a bit off topic and is handled by a different JIRA
-
8. Re: initial connection may fail if noe already failed over
ataylor Feb 18, 2010 5:06 AM (in response to timfox)I think we should probably try the backup only once, if it doesnt work then its probably a client side network issue or the whole intranet is down.
Your point about bringing the nodes up in time is a good one, i will make sure that this is well documented and like u say we will default to false.