4 Replies Latest reply: Mar 27, 2012 10:53 AM by borges RSS

Observations about Replication Activation

Clebert Suconic Master

A few notes for Francisco

 

- As you commented, we need to retry in case of a failure.

 

- Why did you need to use an executor here on the live activation? The live should just use the parent thread and block its thread until it's activated.

 

- These Quote tests are failing because the activation is not working properly on these tests.

  • 1. Re: Observations about Replication Activation
    borges Novice

    An update on these:

     

    • the retry was there, but there was an incorrect comment indicating otherwise. Another complication was that the retry was not happening (in many unit-tests) because the clusterConfiguration was not being done correctly. Svn r12256 fixed that, and svn r12260 r12262 fixed many other tests that were instantiating ClusterConnectionConfiguration objects with incorrect parameters.
    • The reason for the executor in the live activation was to avoid adding a reference to HornetQServer into ClientSessionFactoryImpl.Channel0Handler, as the later is also packaged for hornetq-client. As we discussed on IRC, we'll add some new interface to communicate error from the channel handler, as it avoids the reference and as it is also useful in other cases (even if it does not replace IOCriticalErrorListener).
    • Turned out that the QuorumVoting tests were failing (like many others) because replicating backups were not receiving topology notifications as they should after the recent Topology merge. Svn r12259 fixed that. The QuorumVoting code still needs work, but now that the Topology road-blocker is gone this should be straightforward.

     

    FWIW, I've reenabled the replication tests locally, Jenkins hanged on a given test which I already fixed. I'm running Jenkins again, once I get a full run, I'll re-enable the replication tests on SVN.

     

    FWIW 2, My fixes uncommitted to SVN can be seen at https://github.com/FranciscoBorges/HornetQ-SVN

  • 2. Re: Observations about Replication Activation
    borges Novice

    Some hanging tests were fixed today, but others remain so we can't re-enable the replication tests yet. See https://issues.jboss.org/browse/HORNETQ-877 for the latest blocker.

     

    QuorumVoting needs some work, I need to confirm that the topology will not remove a node in case that node crashes or just disapears. Also, for the test we need to make the nodes stop sending notifications that they are leaving.

  • 3. Re: Observations about Replication Activation
    borges Novice

    Updates on Replication:

     

    • the quorumVoting was much improved;
      • now we actually perform a vote but only if the server does not perform a regular shutdown;
      • if the tally allows we stop taking votes (giving us faster resolution);
      • if the connection fails but is restored, right now we just restart the backup. This could be improved, but I think we need to look first at how this feature will be used in practice;
    • more hangs caused by interactions of changes merged recently from 2.2 and the code in trunk were found.
    • I've decoupled the locking of paging operations and access to the isPaging state. It should be an overall win in performance and reliability.
    • ReplicationPaged tests were (sometimes) hanging. The reason being an OperationContext that would never finish its tasks. The actual reason behind it seems to be that the live would turn off its network too soon, mis a replication token from the backup, and then it would not be able to shutdown. Which in itself is a problem in the OperationContext AFAICT. For the moment, I've added a delay to the crashing for a live server in those conditions and that fixes it. Although this will need addressing.
  • 4. Re: Observations about Replication Activation
    borges Novice

    Regarding some test failures involving replicated paging tests (e.g. ReplicatedPagedFailoverTest.testFailThenReceiveMoreMessagesAfterFailover2):

     

    Turns out the trouble was that after some pages are marked as completed, they are scheduled for clean-up (i.e. it happens asynchronously). In the test, we would ack some messages, and call live.stop() before the pages were cleaned.

    Since HQServer.stop() will turn off the network as the first thing, while the scheduled page.clean() would happen, this would take place after we had cut communications with the backup, leading us to lose the resulting journal updates.

     

    I changed things to close all connections except the one to the backup, stop paging and the replication manager.

     

    Comments?