1 Reply Latest reply on Mar 23, 2012 10:18 AM by brandonfuller

    AS 6.1 - HTTP Replication on Startup Not Working

    brandonfuller

      Summary

       

      I have 1 server running.  User is accessing the server just fine.  I bring up another server and add it to the cluster.  Via the load balancer, I send the user's next request to the new server.  The user receives an error because their session is invalid (not found) on that server.  I had assumed bringing a new server up and joining the cluster would transfer all the state to the new server so it could take requests.

       

      Components

       

      + JBoss AS 6.1 servers

       

      + Upgraded to JGroups 2.12.2.FINAL (to get JDBC_PING)

       

      + Running in Amazon EC2

       

      + Using TCP stack (not UDP issues)

       

      Description

       

      If I bring the 2 servers up and then let the user login, I see the requests load balanced in the cluster as I would expect.  I get no session errors of any kind.  Basically, the cluster operates just fine in that mode so I believe everythign is configured properly.

       

      The problem is specifically about when I bring a new server into the mix.  The new server doesn't seem to have the state from the other servers.  I am allowing that server to come up and giving it minutes before I try to access it.  I was allowing time for the state transfer to occur.  I believe it is according to the log.  However, when the user makes that next request that goes to this server it does not work.  It seems like the state transfer didn't happen correctly.

       

      Here is a log snippet from the 1st server (existing) for when the 2nd server comes online and asks it for a state transfer:

      {code}
      2012-03-23 00:05:25,881 DEBUG [org.jgroups.protocols.pbcast.GMS] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) new=[ip-10-62-83-172.ec2.internal:1099], suspected=[], leaving=[], new view: [ip-10-32-45-187.ec2.internal:1099|1] [ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099]
      2012-03-23 00:05:25,894 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (Incoming-1,null) Block processed at ip-10-32-45-187.ec2.internal:1099
      2012-03-23 00:05:25,900 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) ip-10-32-45-187.ec2.internal:1099: flush coordinator  is starting FLUSH with participants [ip-10-32-45-187.ec2.internal:1099]
      2012-03-23 00:05:25,909 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-1,null) ip-10-32-45-187.ec2.internal:1099: received START_FLUSH, responded with FLUSH_COMPLETED to ip-10-32-45-187.ec2.internal:1099
      2012-03-23 00:05:25,914 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-3,null) ip-10-32-45-187.ec2.internal:1099: FLUSH_COMPLETED from ip-10-32-45-187.ec2.internal:1099, completed true, flushMembers [ip-10-32-45-187.ec2.internal:1099], flushCompleted [ip-10-32-45-187.ec2.internal:1099]
      2012-03-23 00:05:25,915 DEBUG [org.jgroups.protocols.pbcast.STABLE] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) suspending message garbage collection
      2012-03-23 00:05:25,919 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-3,null) ip-10-32-45-187.ec2.internal:1099: all FLUSH_COMPLETED received
      2012-03-23 00:05:25,920 DEBUG [org.jgroups.protocols.pbcast.STABLE] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) resume task started, max_suspend_time=33000
      2012-03-23 00:05:25,926 DEBUG [org.jgroups.protocols.pbcast.NAKACK] (Incoming-5,null)
      [setDigest()]
      existing digest:  ip-10-32-45-187.ec2.internal:1099: [241 : 246 (246)]
      new digest:       ip-10-32-45-187.ec2.internal:1099: [241 : 245 (245)], ip-10-62-83-172.ec2.internal:1099: [0 : 0]
      resulting digest: ip-10-32-45-187.ec2.internal:1099: [241 : 246 (246)], ip-10-62-83-172.ec2.internal:1099: [0 : 0 (0)]
      2012-03-23 00:05:25,926 DEBUG [org.jgroups.protocols.pbcast.GMS] (Incoming-5,null) ip-10-32-45-187.ec2.internal:1099: view is [ip-10-32-45-187.ec2.internal:1099|1] [ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099]
      2012-03-23 00:05:25,930 DEBUG [org.jgroups.protocols.FD_SOCK] (Incoming-5,null) VIEW_CHANGE received: [ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099]
      2012-03-23 00:05:25,937 DEBUG [org.jgroups.protocols.FD_SOCK] (FD_SOCK pinger,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) ping_dest is ip-10-62-83-172.ec2.internal:1099, pingable_mbrs=[ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099]
      2012-03-23 00:05:25,946 DEBUG [org.jgroups.protocols.pbcast.STABLE] (Incoming-5,null) [ergonomics] setting max_bytes to 800KB (2 members)
      2012-03-23 00:05:25,947 INFO  [org.jboss.ha.framework.server.ClusterPartition.lifecycle.DefaultPartition] (Incoming-5,null) New cluster view for partition DefaultPartition (id: 1, delta: 1, merge: false) : [ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099]
      2012-03-23 00:05:25,948 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (Incoming-5,null) dead members: []
      2012-03-23 00:05:25,948 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (Incoming-5,null) membership changed from 1 to 2
      2012-03-23 00:05:25,949 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (AsynchViewChangeHandler Thread) Begin notifyListeners, viewID: 1
      2012-03-23 00:05:25,950 INFO  [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) I am (ip-10-32-45-187.ec2.internal:1099) received membershipChanged event:
      2012-03-23 00:05:25,950 INFO  [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) Dead members: 0 ([])
      2012-03-23 00:05:25,951 INFO  [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) New Members : 1 ([ip-10-62-83-172.ec2.internal:1099])
      2012-03-23 00:05:25,952 INFO  [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.DefaultPartition] (AsynchViewChangeHandler Thread) All Members : 2 ([ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099])
      2012-03-23 00:05:25,952 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (AsynchViewChangeHandler Thread) End notifyListeners, viewID: 1
      2012-03-23 00:05:25,953 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-5,null) Received new cluster view: [ip-10-32-45-187.ec2.internal:1099|1] [ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099]
      2012-03-23 00:05:25,956 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-5,null) ip-10-32-45-187.ec2.internal:1099: installing view [ip-10-32-45-187.ec2.internal:1099|1] [ip-10-32-45-187.ec2.internal:1099, ip-10-62-83-172.ec2.internal:1099]
      2012-03-23 00:05:26,054 DEBUG [org.jgroups.protocols.pbcast.STABLE] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) resuming message garbage collection
      2012-03-23 00:05:26,057 DEBUG [org.jgroups.protocols.pbcast.GMS] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) ip-10-32-45-187.ec2.internal:1099: sending RESUME event
      2012-03-23 00:05:26,057 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (ViewHandler,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) ip-10-32-45-187.ec2.internal:1099: received RESUME, sending STOP_FLUSH to all
      2012-03-23 00:05:26,062 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-9,null) ip-10-32-45-187.ec2.internal:1099: received STOP_FLUSH, unblocking FLUSH.down() and sending UNBLOCK up
      2012-03-23 00:05:26,063 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (Incoming-9,null) Unblock processed at ip-10-32-45-187.ec2.internal:1099
      2012-03-23 00:05:26,237 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (Incoming-11,null) Responding to state requester ip-10-62-83-172.ec2.internal:1099 with address 10.32.45.187:35530 and digest null
      2012-03-23 00:05:26,239 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER server socket acceptor,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) StateProviderThreadSpawner listening at 10.32.45.187:35530...
      2012-03-23 00:05:26,248 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER server socket acceptor,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) StateProviderThreadSpawner listening at 10.32.45.187:35530...
      2012-03-23 00:05:26,251 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Accepted request for state transfer from /10.62.83.172:55185 handing of to PooledExecutor thread
      2012-03-23 00:05:26,257 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Running on Thread[STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099,5,JGroups]. Accepted request for state transfer from /10.62.83.172:55185, original buffer size was 8192 and was reset to 8192, passing outputstream up...
      2012-03-23 00:05:26,261 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Generating state.  Can provide? true
      2012-03-23 00:05:26,374 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Writing 1 StoredEntries to stream
      2012-03-23 00:05:26,833 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) State generated, closing object stream
      2012-03-23 00:05:26,834 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) State writer is closing the socket
      2012-03-23 00:05:26,969 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (Incoming-15,null) Block processed at ip-10-32-45-187.ec2.internal:1099
      2012-03-23 00:05:26,970 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-15,null) ip-10-32-45-187.ec2.internal:1099: received START_FLUSH, responded with FLUSH_COMPLETED to ip-10-62-83-172.ec2.internal:1099
      2012-03-23 00:05:26,973 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (Incoming-17,null) Responding to state requester ip-10-62-83-172.ec2.internal:1099 with address 10.32.45.187:35530 and digest null
      2012-03-23 00:05:26,975 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER server socket acceptor,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) StateProviderThreadSpawner listening at 10.32.45.187:35530...
      2012-03-23 00:05:26,975 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Accepted request for state transfer from /10.62.83.172:32951 handing of to PooledExecutor thread
      2012-03-23 00:05:26,976 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Running on Thread[STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099,5,JGroups]. Accepted request for state transfer from /10.62.83.172:32951, original buffer size was 8192 and was reset to 8192, passing outputstream up...
      2012-03-23 00:05:26,978 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) getState called for service DistributedReplicantManager
      2012-03-23 00:05:27,000 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) State writer is closing the socket
      2012-03-23 00:05:27,156 DEBUG [org.jgroups.protocols.pbcast.FLUSH] (Incoming-19,null) ip-10-32-45-187.ec2.internal:1099: received STOP_FLUSH, unblocking FLUSH.down() and sending UNBLOCK up
      2012-03-23 00:05:27,157 DEBUG [org.jboss.ha.framework.server.ClusterPartition.DefaultPartition] (Incoming-19,null) Unblock processed at ip-10-32-45-187.ec2.internal:1099
      2012-03-23 00:05:27,254 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchKeyChangeHandler Thread) election result =true, electionPolicy=DistributedReplicantManager
      2012-03-23 00:05:27,255 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchKeyChangeHandler Thread) partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=true, viewID=356682581, partition=DefaultPartition
      2012-03-23 00:05:27,344 DEBUG [org.jboss.ha.framework.server.HARMIServerImpl$RefreshProxiesHATarget] (AsynchKeyChangeHandler Thread) replicantsChanged 'HAJNDI' to 2 (intra-view id: 356682581)
      2012-03-23 00:05:27,577 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (Incoming-5,null) Responding to state requester ip-10-62-83-172.ec2.internal:1099 with address 10.32.45.187:35530 and digest null
      2012-03-23 00:05:27,579 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER server socket acceptor,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) StateProviderThreadSpawner listening at 10.32.45.187:35530...
      2012-03-23 00:05:27,579 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Accepted request for state transfer from /10.62.83.172:39044 handing of to PooledExecutor thread
      2012-03-23 00:05:27,581 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Running on Thread[STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099,5,JGroups]. Accepted request for state transfer from /10.62.83.172:39044, original buffer size was 8192 and was reset to 8192, passing outputstream up...
      2012-03-23 00:05:27,582 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Generating state.  Can provide? true
      2012-03-23 00:05:27,582 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) Writing 2 StoredEntries to stream
      2012-03-23 00:05:27,589 DEBUG [org.infinispan.statetransfer.StateTransferManagerImpl] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) State generated, closing object stream
      2012-03-23 00:05:27,596 DEBUG [org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER] (STREAMING_STATE_TRANSFER-sender-1,DefaultPartition-HAPartition,ip-10-32-45-187.ec2.internal:1099) State writer is closing the socket
      2012-03-23 00:05:27,844 DEBUG [org.jboss.ha.singleton.HASingletonElectionPolicySimple] (AsynchKeyChangeHandler Thread) election with 2 nodes and position=0, will choose node[0] picked node ip=ip-10-32-45-187.ec2.internal/10.32.45.187 port=60062
      2012-03-23 00:05:27,845 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchKeyChangeHandler Thread) election result =true, electionPolicy=org.jboss.ha.singleton.HASingletonElectionPolicySimple
      2012-03-23 00:05:27,845 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchKeyChangeHandler Thread) partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=true, viewID=356682581, partition=DefaultPartition
      2012-03-23 00:05:27,859 DEBUG [org.jboss.cache.invalidation.bridges.JGCacheInvalidationBridge] (AsynchKeyChangeHandler Thread) The list of replicants for the JG bridge has changed, computing and updating local info...
      2012-03-23 00:05:31,946 DEBUG [org.jgroups.protocols.FD] (Timer-3,) sending are-you-alive msg to ip-10-62-83-172.ec2.internal:1099 (own address=ip
      {code}

      Here is my jgroups-channelfactory-stacks.xml contents (with JDBC creds removed):

       

      {code:xml}
      

      <?xml version="1.0" encoding="UTF-8"?>

       

      <protocol_stacks>

             <stack name="tcp"

                 description="TCP based stack, with flow control and message bundling.

                              TCP stacks are usually used when IP multicasting cannot

                              be used in a network, e.g. because it is disabled (e.g.

                              routers discard multicast)">

              <config>

                  <TCP

                       singleton_name="tcp"

                       start_port="${jboss.jgroups.tcp.tcp_port:7600}"

                       tcp_nodelay="true"

                       loopback="true"

                       recv_buf_size="20000000"

                       send_buf_size="640000"

                       discard_incompatible_packets="true"

                       max_bundle_size="64000"

                       max_bundle_timeout="30"

                       use_incoming_packet_handler="true"

                       enable_bundling="false"

                       use_send_queues="false"

                       sock_conn_timeout="300"

                       skip_suspected_members="true"

                       timer.num_threads="12"

                       enable_diagnostics="${jboss.jgroups.enable_diagnostics:true}"

                       diagnostics_addr="${jboss.jgroups.diagnostics_addr:224.0.75.75}"

                       diagnostics_port="${jboss.jgroups.diagnostics_port:7500}"

       

                         thread_pool.enabled="true"

                         thread_pool.min_threads="20"

                         thread_pool.max_threads="200"

                         thread_pool.keep_alive_time="5000"

                         thread_pool.queue_enabled="true"

                         thread_pool.queue_max_size="1000"

                         thread_pool.rejection_policy="discard"

       

                       oob_thread_pool.enabled="true"

                         oob_thread_pool.min_threads="20"

                         oob_thread_pool.max_threads="200"

                         oob_thread_pool.keep_alive_time="1000"

                         oob_thread_pool.queue_enabled="false"

                         oob_thread_pool.rejection_policy="discard"/>

       

                  <JDBC_PING connection_driver="com.mysql.jdbc.Driver" connection_username="username" connection_password="password" connection_url="jdbc:mysql://server:3306/db" />

                  <MERGE2 max_interval="100000" min_interval="20000"/>

                  <FD_SOCK start_port="${jboss.jgroups.tcp.fd_sock_port:57600}"/>

                  <FD timeout="6000" max_tries="5"/>

                  <VERIFY_SUSPECT timeout="1500"/>

                  <BARRIER/>

                  <pbcast.NAKACK use_mcast_xmit="false" gc_lag="0"

                                 retransmit_timeout="300,600,1200,2400,4800"

                                 discard_delivered_msgs="true"/>

                  <UNICAST timeout="300,600,1200,2400,3600"/>

                  <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"

                                 max_bytes="400000"/>           

                  <VIEW_SYNC avg_send_interval="10000"/>              

                  <pbcast.GMS print_local_addr="true" join_timeout="3000"

                              view_bundling="true"

                              view_ack_collection_timeout="5000"

                              resume_task_timeout="7500"/>

                  <UFC max_credits="2000000" ignore_synchronous_response="true"/>

                  <MFC max_credits="2000000" ignore_synchronous_response="true"/>

                  <FRAG2 frag_size="60000"/>

                  <pbcast.STREAMING_STATE_TRANSFER/>

                  <pbcast.FLUSH timeout="0" start_flush_timeout="10000"/>

              </config>

          </stack>

      </protocol_stacks>

      {code}

      Anyone have an ideas on what I missing?

       

      It seems to be related to https://issues.jboss.org/browse/ISPN-835