9 Replies Latest reply: Nov 24, 2012 5:10 PM by jay shaughnessy RSS

Resource's availability report shows Intermittent availability

Rafael Soares (Tuelho) Newbie

Hello!

 

I'm having some problems with agent runing on a box with too many resources being monitored.

For example I have an agent runing in a box with 9 JBoss instances. Each JBoss hosting many apps (including many ejb3, ds, wars, etc). We observed that with  default configurations the agent couldn't collect all resource's metrics. The 'Failed Collections Per Minute' metric from this Agent shows the number of fails. Then I tried to tuning the Agent parameters. We tuned the following params:

 

RHQ Agent > CONFIGURATION >

> Plugin Container

   > Measurement Collection Threadpool Size = 100 [default: 10]

 

> Client Sender

   > Queue Size = 100000 [default: 50000]

   > Maximum Concurrency = 20  [default: 5]

   > Send Throttling     = 200 commands : 1000 ms [default: 100:1000]

   > Queue Throttling    = 400 commands : 4000 ms [default: 200:2000]

 

The screenshot attached here shows the bearraviour for the 'Currently Schedule Measurements' and 'Failed Collections Per Minute' metrics for before and after this tuning. Until now the problem with metrics collection and reporting seems to have been resolved. But, there is another problem! The resource availability is intermittent. As you can observe in the screenshot 'jboss-availability.png', It shows a DOWN event with a duration of 5 minutes (some times less) intermittently. But this isn't true. The resource is aways UP. I think the agent still have some kind of difficult to report accurately the availability to the server. Is there a way to identify this problem or tuning one more agent's param? I already tried to decrease the Availability Scan Period to 180 seg (3 min) to try to increase the frequency of availability test. But no success.

 

Thanks.

  • 1. Re: Resource's availability report shows Intermittent availability
    mazz Master

    try to INCREASE the availability scan period to something like 10 minutes. What's probably happening is you have so many resources, they are all collectively taking too long to report back. You also might want to change the value of the server-side "Agent Quiet Time" system setting. If the agent doesn't report availability within a certain amount of time, the server will assume the agent is down and mark it as such. The default is 15 minutes IIRC, so maybe that isn't the issue (I find it hard to believe that it takes your agent longer than 15 minutes to collect all availabilities and report the changes up to the server), but you never know?? In older versions, that agent quiet time setting on the server was 5 minutes and on very large environments, that was in fact too low for some people. So, see what value you have in there (Administration>System Settings)

     

    Anyway, those are two settings I would look at.

     

    What version of JBoss AS instances do you have? Is it AS/EAP 4? or AS/EAP 5?

  • 2. Re: Resource's availability report shows Intermittent availability
    Rafael Soares (Tuelho) Newbie

    Hi John. Thanks by your reply.

     

    I'm using JON 2.4.1, so the 'agent quiet time' is already set to 15 min. The AS version is 4.2.2.GA - there is a plan to upgrade to EAP soon.

     

    I'll try to INCREASE the availability scan period to 10 minutes as you suggested.

     

    Any news I report here.

     

    Thanks.

  • 3. Re: Resource's availability report shows Intermittent availability
    Rafael Soares (Tuelho) Newbie

    Hi Mazz.

     

    I've changed the the availability scan period to INCREASE it but no success yet. As you can see on screenshot the "false positive" still occurs on Availability Report. The DOWN events are from small times (2..7 minutes), but they seems strange when someone (admins) takes a look on it. is there some more Agent params that would be import to verify? Is there any blog post/wiki page discussing something about tuning Agent params when runing in boxes with a big amount of resources?

     

    Thanks.

     

    Availability-FalseAlarms.png

  • 4. Re: Resource's availability report shows Intermittent availability
    Rafael Soares (Tuelho) Newbie

    I was thinking if maybe would be interesting have some kind of "Dampening" to be used by ResourceComponents' getAvailability() methods.This could help to avoid some false positives regarding availability of Resources. What do you think?

     

    regards.

  • 5. Re: Resource's availability report shows Intermittent availability
    Rafael Soares (Tuelho) Newbie

    Hello Guys!

     

    Someone have already had this kind of issues regarding to fails on metrics collection and sending to the Server?

    I observed this on rhq-agents used in Boxes with too many resources and small time intervals for metric collections.

     

    I'd appreciate any sort of directions...

     

    Thanks.

  • 6. Re: Resource's availability report shows Intermittent availability
    mazz Master

    The only thing we could recommend is either increasing the metric collection intervals OR reducing the number of metrics that are actually collected (in other words, go through your resources and ONLY ENABLE those metrics that you really care about).

     

    That is for metric collections.

     

    As for availability collections, since availability is collected and reported differently than metrics, you can't just "disable" availability for a particular set of resources.

     

    There is talk about refactoring/redesigning the way availability is collected and reported, but nothing to date has been done on this.

     

    We actually haven't seen many people complain about this - usually, people aren't monitoring so many resources per agent that the availabillty collection slows down to the point where it causes problems. Tweeking the availability scan and quiet time settings are right now the only things to customize to try to work around problems.

  • 7. Re: Resource's availability report shows Intermittent availability
    Rafael Soares (Tuelho) Newbie

    The only thing we could recommend is either increasing the metric collection intervals OR reducing the number of metrics that are actually collected (in other words, go through your resources and ONLY ENABLE those metrics that you really care about).

     

    Mazz, what about tuning the RHQ Agent to be more robust in these scenarios? I observed that when tuning some parameter on Agent like:

    RHQ Agent > CONFIGURATION > Plugin Container and Client Sender

     

    The agent can handle more metrics and seems to be more stable. But there isn't a DOC discussing about Tuning the agent parameters. I've found [1] but no info about agent perf tuning. Maybe with the approach of caching [2] this kind of issues could be addressed :-)


    [1] http://www.rhq-project.org/display/RHQ/Performance+Engineering

    [2] http://www.rhq-project.org/display/RHQ/Ideas+about+Caching

  • 8. Re: Resource's availability report shows Intermittent availability
    Vlad Craciunoiu Newbie

    Hi,

     

    We have a similar situation: an agent monitoring a machine with a JBoss 4.2; there are many resources in JBoss, but otherwise that JBoss is the only application running on that machine. The availability goes up then down and back again every minute, as the availability metric is set to 1 minute eventhough the JBoss is running fine. We use RHQ 4.4. We have an alert set on it and the admins go crazy . We increased the collecting time for availability metric but would be interesting to know why this happens.

     

    Regards,

    Vlad

  • 9. Re: Resource's availability report shows Intermittent availability
    jay shaughnessy Apprentice

    The advice above is not relevant to your RHQ 4.4 installation as much has changed in the area of availability collection, agent quite time, etc (see release notes for more).   As for your situation, that's odd, it seems like the server's availabilitgy check is just failing quite often.  Perhaps timing out?  Look in your agent log got errors and see if it sheds any light on what may be happening.  You can change your avail collection interval to be higher but I'mnot sure that will realy help the situation.