11 Replies Latest reply on Jul 23, 2008 6:33 PM by brian.stansberry

    JBPAPP-863 -- FC blocks during slow failure detection

    brian.stansberry

      Discussion of http://jira.jboss.com/jira/browse/JBPAPP-863 .

      Removing FC is not a general purpose option as it is highly likely that will lead to OOM problems. Removing it in a 2 node TCP-based cluster like this test uses is OK, as FC serves no purpose in such a setup. See "Why is FC needed on top of TCP?" section at http://wiki.jboss.org/wiki/JGroupsFC . But I don't think the purpose of these tests is to test a specialized config, so don't think we should do it.

      To an extent this result is showing a configuration tradeoff in the config of the FD protocol. The default config we ship with has this:

      <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
      <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
      


      The total time it takes to detect a failure and remove a dead member is (FD.timeout * FD.max_tries) + VERIFY_SUSPECT.timeout = 51500 ms. That is, unless FD_SOCK detects the fairly immediately, which it doesn't in this pull-the-plug scenario.

      51.5 secs is a long time. Pretty much your "cca 1 minute after other node unplugging." During that time, FC will block waiting for credits from the unreachable node (once it runs out of credits). Once the unreachable peer is removed from the group, FC will unblock.

      We have a high timeout to prevent false suspicions when nodes don't reply to heartbeats due to temporary CPU spikes, long GC pauses etc. Perhaps 51500 ms is excessive. Let's discuss this with the group on Monday's clustering conf call. Users can always reduce this config to some lower value > X if they feel any extended CPU spikes or GC pauses will not exceed X. I wouldn't recommend less than 30 seconds or so. Used to be 15 and we were constantly hearing about false suspicion issues.

      That's the core issue. Now a side issue:

      The fact that one thread is trying to lock the session while another is trying to replicate it proves that there are multiple requests for the session. Those requests stack up and will put a very high replication load on the system. Question is why multiple requests are occuring. I want to understand this. Some possibilities:

      1) Test servlet is flushing the response (i.e not waiting for JBossWeb to do it after the the replication), test client is reading the response, considering it completed and making another request. Dominik, is this possible?

      2) Test client waits X seconds for a response; if doesn't get one retries the request. Analogous to user getting impatient and hitting refresh button. If this is what is happening, need to know what X is and decide how reasonable that is; perhaps make it a random value inside a range (i.e. 100% of user don't hit refresh after 10 secs).

      3) mod_jk is itself retrying the request, either on same node or as an attempted failover (looks like same node). Do you have worker.xxx.reply_timeout configured in workers.properties?

        • 1. Re: JBPAPP-863 -- FC blocks during slow failure detection
          belaban

          It's not just FC that might block for credits from crashed members (by pulling the plug), but also all cluster-wide calls with GET_ALL will either block or time out (if configured) because they won't get a response from the 'pulled' members. Note that of course this doesn't apply to GET_FIRST or other modes, e.g. a response filter:
          a response filter could handle SUSPECT messages, and terminate a call if the only responses missing are one from suspected members.

          If we decrease that timeout, we might run into false suspicions again. However, with 2.5 and later versions, since we use out-of-band (OOB) messages for heartbeats, which are handled by a separate thread pool, we should at least not run into the problem that a heartbeat is not handled because it is stuck behind a regular message, as this won't be the case.
          So in this light, +1 for reducing the timeout in FD, BUT ONLY in 2.6.x, *NOT* 2.4.x !
          Changing to FD_ALL from FD in a future version will also help.

          • 2. Re: JBPAPP-863 -- FC blocks during slow failure detection
            dpospisil

            I am not flushing responses and also do not resend requests until I receive response from the client - I am overriding default HttpClient retry policy. But yes, I have configured mod_jk to send retries. This is used worker setup:

            worker.jawa03.port=8009
            worker.jawa03.host=192.168.1.3
            worker.jawa03.type=ajp13
            worker.jawa03.lbfactor=1
            worker.jawa03.socket_timeout=30
            worker.jawa03.connect_timeout=5000
            worker.jawa03.prepost_timeout=8000
            worker.jawa03.retries=2
            worker.jawa03.reply_timeout=8000

            • 3. Re: JBPAPP-863 -- FC blocks during slow failure detection
              brian.stansberry
              • 4. Re: JBPAPP-863 -- FC blocks during slow failure detection
                brian.stansberry

                The issue of whether to include FC in a config should be better covered in the docs:

                http://jira.jboss.com/jira/browse/JBAS-5677

                Re: mod_jk doing retries, I expect the config described would cause test failures even if there were a much faster failure timeout (say 20-30 seconds). After 8 secs, mod_jk will retry, and this is a counter-based test, so the counter will increase by 2 and you'll get a failure.

                This illustrates why a retry when the request has reached a server is dangerous. For EJB proxies the clustering logics forbids this.

                If you want to enable retries, suggest you use the recovery_options parameter (see http://tomcat.apache.org/connectors-doc/reference/workers.html ) and set it to 1 such that retries are not attempted if the request reaches the server.

                OK, now back to this JIRA and the general testing program.

                First, besides the good points Bela makes about 2.4.x vs. 2.6.x, I wouldn't want to change config files in a micro release or EAP CP. So, if we don't change the defaults, this becomes a knowledge-base issue. Hence the docs update JIRA, and I know the support guys are aware of this thread.

                For the general testing program, a bigger cluster is probably better if you're testing what happens when the plug is pulled on a single node. *If* you need to drill further into what happens with a 2 node cluster, then removing FC from your TCP-based config makes sense. We already know the effect FC has, and removing FC from a 2 node TCP config is correct. So removing it will let you check for other effects.

                • 5. Re: JBPAPP-863 -- FC blocks during slow failure detection
                  dpospisil

                  Yes, I want to move to bigger cluster as soon as possible but I am somewhat resource limited at this point. But at least I have access to local switch now so I should be able to do more tests more easilly. Anyway, I was able to do couple more test run and still I am not convicted that FC works as expected. What I am seeing is that the failed node is suspected and removed from cluster in ~ 1 minute after failure, but lots of threads keeps blocking in FC more than 5 minutes later. I have attached logs to JBPAPP-863.

                  • 6. Re: JBPAPP-863 -- FC blocks during slow failure detection
                    brian.stansberry

                    Ok, thanks, I'll have a look. How much is this affecting throughput? Under heavy load I'd expect some threads blocking in FC.

                    • 7. Re: JBPAPP-863 -- FC blocks during slow failure detection
                      brian.stansberry

                      Also, please get rid of the

                      worker.jawa03.retries=2
                      worker.jawa03.reply_timeout=8000
                      


                      and see what happens. I'm wondering if that is causing a couple of attempts per node, then a failover to the other node, a couple more attempts, and so on. Effect is multiple semi-concurrent requests for the same session leading to semi-deadlocks.

                      • 8. Re: JBPAPP-863 -- FC blocks during slow failure detection
                        vblagojevic

                        Hey guys,

                        If you search for "viewAccepted" in jawa04.log (attachment in JBPAPP-863) you will notice that tcp stack cache node does not receive a new view until about 15 min after the pulled plug is reconnected. udp based cache nodes work as expected. Since FC relies on receiving views to function properly we have all the threads blocked in FC#down....

                        Why do we have such a big delay in receiving that new view?


                        • 9. Re: JBPAPP-863 -- FC blocks during slow failure detection
                          brian.stansberry

                          Vladimir,

                          Below is what I think he's using as the config (the std commented out TCP config in the jboss-web-cluster sar). Dominik, please post your config if this isn't it.

                          <TCP bind_addr="thishost" start_port="7810" loopback="true"
                           tcp_nodelay="true"
                           recv_buf_size="20000000"
                           send_buf_size="640000"
                           discard_incompatible_packets="true"
                           enable_bundling="true"
                           max_bundle_size="64000"
                           max_bundle_timeout="30"
                           use_incoming_packet_handler="true"
                           use_outgoing_packet_handler="false"
                           down_thread="false" up_thread="false"
                           use_send_queues="false"
                           sock_conn_timeout="300"
                           skip_suspected_members="true"/>
                           <TCPPING initial_hosts="thishost[7810],otherhost[7810]" port_range="3"
                           timeout="3000"
                           down_thread="false" up_thread="false"
                           num_initial_members="3"/>
                           <MERGE2 max_interval="100000"
                           down_thread="false" up_thread="false" min_interval="20000"/>
                           <FD_SOCK down_thread="false" up_thread="false"/>
                           <FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
                           <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
                           <pbcast.NAKACK max_xmit_size="60000"
                           use_mcast_xmit="false" gc_lag="0"
                           retransmit_timeout="300,600,1200,2400,4800"
                           down_thread="false" up_thread="false"
                           discard_delivered_msgs="true"/>
                           <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                           down_thread="false" up_thread="false"
                           max_bytes="400000"/>
                           <pbcast.GMS print_local_addr="true" join_timeout="3000"
                           down_thread="false" up_thread="false"
                           join_retry_timeout="2000" shun="true"
                           view_bundling="true"/>
                           <FC max_credits="2000000" down_thread="false" up_thread="false"
                           min_threshold="0.10"/>
                           <FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
                           <pbcast.STATE_TRANSFER down_thread="false" up_thread="false" use_flush="false"/>


                          There's a stack3.txt attached to the JIRA which doesn't show any problems with any of the JGroups threads; e.g. a blocked IncomingPacketHandler that would prevent a view message progagating. But that stack trace doesn't come from the same time as the logs; not sure when it was taken. It might have been taken before the node failure should have been detected.

                          • 10. Re: JBPAPP-863 -- FC blocks during slow failure detection
                            vblagojevic

                            Brian,

                            I combed through all the logs and I still could not figure out what went wrong. Just as a precaution we should as you suggest remove FC from one buddy configurations for the time being - until we track this one down. What puzzles me the most is how the views are not propagated upon merge along with the fact that FC is above GMS and of course MERGE2 and I do not see how *not* having FC in the stack makes the test pass and how having it does not. Unless maybe view handler thread gets somehow blocked in FC.down?

                            Dominik,

                            Could you please repeat this "experiment" but this time with GMS and MERGE2 turned on trace as well - not just FC. Thank you.

                            • 11. Re: JBPAPP-863 -- FC blocks during slow failure detection
                              brian.stansberry

                              Dominik/Vladimir:

                              Any updates on this?