2 Replies Latest reply on Jul 17, 2012 12:03 PM by mcipperly

    mod_cluster reports 1 load using AverageSystemLoadMetric

    mcipperly

      Hi All,

       

      First of all, thanks very much for the hard work and great product in mod_cluster! I've used it in several deployments at this point and think it's a testament to the quality (and excellent documentation, in manuals and here with the community) that this is my first time having to reach out after hitting an issue.

       

      We're seeing a very strange issue with mod_cluster in one of our environments. Our current configuration is as follows:

       

      Web Front-end: Apache httpd 2.2.22 on Solaris 10

      JBoss Back-end: JBoss EAP 5.1.0 on Solaris 10, Java 1.6.0_24

      Mod_cluster: We've tried versions 1.2.0-Final and 1.2.1-Final with varying degrees of success, which I'll detail in a bit

       

      We do have multiple environments, a pre-production/staging environment which is configured exactly as production (as far as I can tell, definitely including the versions above) and our production environment. We first implemented mod_cluster in our pre-production environment with the following three metrics configured in mod_cluster-jboss-beans.xml:

       

                <inject bean="ActiveSessionsLoadMetric"/>

                <inject bean="AverageSystemLoadMetric"/>

                <inject bean="HeapMemoryUsageLoadMetric"/>

       

      We configured ActiveSessionsLoadMetric with a weight of 1, HeapMemoryUsageLoadMetric with a weight of 2, and ActiveSessionsLoadMetric with a weight of 3 (so this should be half of our total metric, with heap usage being ~33% and active sessions being ~16%). In the pre-production environment, this configuration gave us exactly what we wanted - load averages across our backend servers were consistent and no individual JVM ended up going crazy. We'd get a real number for our Load on the status page, and saw this when under load with tracing in our JBoss logs:

       

      2012-06-15 14:57:48,249 TRACE [org.jboss.modcluster.mcmp.impl.DefaultMCMPHandler] (ContainerBackgroundProcessor[StandardEngine[jboss.web]]) Sending command [org.jboss.modcluster.mcmp.impl.DefaultMCMPRequest{requestType=STATUS,wildcard=false,jvmRoute=staging-10com,parameters={Load=49}}] to proxy [web01/10.0.1.1:9000]

       

      However, when we tried to install this in our production environment, we saw strange behavior; on the mod_cluster status page, the Load was constantly reading back as -1. At this point, we were using mod_cluster 1.2.0-Final. Once this started happening and after some investigation, it looked like this was happening because mod_cluster was sending a zero load metric to Apache (after sending the initial -1 during server startup) at all times (even under load, if load was sent directly at the server bypassing the web layer or if we bumped load averages on the server up to 30~40 using do-nothing perl scripts), which caused Apache to continue to report -1 load and not send traffic to this server based on https://issues.jboss.org/browse/MODCLUSTER-279 . Here's the excerpt from the logs:

       

      2012-06-15 15:26:44,870 TRACE [org.jboss.modcluster.mcmp.impl.DefaultMCMPHandler] (ContainerBackgroundProcessor[StandardEngine[jboss.web]]) Sending command [org.jboss.modcluster.mcmp.impl.DefaultMCMPRequest{requestType=STATUS,wildcard=false,jvmRoute=app01-10com,parameters={Load=0}}] to proxy [prdweb01/10.200.0.11:9000]

       

      After upgrading to mod_cluster 1.2.1, we did see the "sending zero load" behavior go away as we'd hoped, but instead saw the above with a continuous load of 1 the whole time (which is what it should do based on the above MODCLUSTER-279). If we take out the AverageSystemLoadMetric, we do see valid/expected values being sent forward in the STATUS messages. I've checked that the environment does return valid values for the Java method getSystemLoadAverage(). At this point, I'm not too sure where else we should look and wanted to check in with the community to see if there's any thoughts on what could be causing this before trying to dive into the mod_cluster code to debug. The fact that it's happening in one of our environments but not the other does have me somewhat perplexed, and I've tried to check everything plausible and relevant (JVM parameters, classpath, etc..), though if there's anything that *could* cause this which would be related to any of the aformentioned paramaters, I'm definitely willing to check it out again. If there's any additional logs or configurations which would help too, I'd be glad to provide.

       

      Thanks again!