3 Replies Latest reply on Feb 25, 2004 12:39 PM by crobert

    TCPGOSSIP clustering problem with server restart

    crobert

      Hello,

      I'm having the following clustering problem:

      I have two JBoss 3.2.3 servers in a local network, clustering uses:

      --> TCP and TCPGOSSIP with the following settings (in cluster-service.xml)
      <TCP start_port="7800"/>
      <TCPGOSSIP initial_hosts="myserver[7500]"
      gossip_refresh_rate="10000" num_initial_members="2"
      up_thread="true" down_thread="true" />
      <MERGE2 min_interval="1000" max_interval="2000" />
      <FD shun="true" up_thread="true" down_thread="true"
      timeout="2000" max_tries="10" />
      <VERIFY_SUSPECT timeout="2000" num_msgs="3"
      up_thread="true" down_thread="true" />
      <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
      up_thread="true" down_thread="true" />
      <pbcast.STABLE desired_avg_gossip="2000"
      up_thread="true" down_thread="true" />
      <UNICAST timeout="2000" window_size="100" min_threshold="10"
      down_thread="true" />
      <FRAG frag_size="8192"
      down_thread="true" up_thread="true" />
      <pbcast.GMS join_timeout="2000" join_retry_timeout="2000"
      shun="true" print_local_addr="true" />
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />

      where "myserver" is the DNS name for the server that runs the GOSSIP service (and on which a JBoss instance will be started)

      --> GOSSIP is started with the following command line arguments:

      -port 7500 -expiry 30000 -bindaddress myserver

      In most of the cases, the two servers see each other and everything goes well. However I noticed that in the following scenario, the cluster is not always formed:

      1. Start GOSSIP server
      2. Start JBoss on my first machine
      3. (after the first machine JBoss starts successfully) Start JBoss on my seconds machine.
      4. The cluster is formed and I get load balancing.
      5. If I kill server 1 and restart it before the GOSSIP server detects that the first node is dead, in most of the cases, the cluster is not formed back, although the GOSSIP server on trace mode detects as having two servers in the cluster (and accepts connections from both of the servers). The two nodes no not see each other and I don't have failover or loadbalancing: all requests go to only one server.

      Is this the expected behavior ?

      I also noticed that in some cases JBoss is blocked for quite a while after the "GMS: address is ..." line (usually when it cannot find the cluster and probably awaits for a timeout to expire)

      Robert

      Windows XP SP1
      JBoss 3.2.3 with Jetty (jboss-3.2.3_jetty-4.2.14.zip)
      Sun J2SE 1.4.2_03