7 Replies Latest reply on Feb 11, 2014 6:23 PM by genman

    Ridiculous time taken to compute OOBs (on Oracle); RHQ 4.9

    genman
      17:49:03,425 INFO  [org.rhq.enterprise.server.measurement.MeasurementOOBManagerBean] (RHQScheduler_Worker-4) Finished calculating 32 OOBs in 3086252 ms
      17:49:03,425 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-4) Auto-calculation of OOBs completed in [3086260]ms
      
      
      

       

      I'm not sure why OOB calculations would take this long. I'm digging into why but if somebody knows what to look for here, that'd be awesome.

       

      I'm thinking it'd be nice to turn off OOB stuff as I don't use it anyway, and it looks like it's making my purge job take more than an hour anyway.

       

      Seems to me it's just the number of metrics is too much, that querying Oracle for each one takes more time than it can support. It seems the answer might be to:

      1) Have an option to disable OOB completely

      2) Only do a partial calculation (like 30%)

      3) Optimize the round trip to the database, e.g. query all (or a portion) baselines at once, then merge the results.

        • 1. Re: Ridiculous time taken to compute OOBs (on Oracle); RHQ 4.9
          genman

          I'm thinking if you did a partial or optimized version of this, you'd change MetricsServer to return Iterable<AggregateNumericMetric> into Collection (or List), so you could see the size and either execute the OOB updates in parallel (multiple threads) or optimize as I suggest. E.g.

           

              public int calculateOOB(AggregateNumericMetric metric) {
                  List<MeasurementBaseline> baselines = entityManager.createQuery(
                      "select baseline from MeasurementBaseline baseline where baseline.schedule.id = :scheduleId")
                      .setParameter("scheduleId", metric.getScheduleId())
                      .getResultList();
                  if (baselines.isEmpty()) {
                      return 0;
                  }
          

           

          could be

           

              public int calculateOOB(List<AggregateNumericMetric> metric) {
                 List<Integer> schedules = ....;
                  List<MeasurementBaseline> baselines = entityManager.createQuery(
                      "select baseline from MeasurementBaseline baseline where baseline.schedule.id in :scheduleIds")
                      .setParameter("scheduleIds", schedules)
                      .getResultList();
                  for (MeasurementBaseline baseline : baselines) {
          ...
                  }
          

           

          And so on...

          1 of 1 people found this helpful
          • 2. Re: Ridiculous time taken to compute OOBs (on Oracle); RHQ 4.9
            genman

            Bug 1059863 created as a request to optionally disable this feature.

            • 3. Re: Ridiculous time taken to compute OOBs (on Oracle); RHQ 4.9
              pilhuhn

              Elias,

              thanks for opening the BZ - I've commented - and actually similar to what you are suggesting above.

              We would need to wrap the calculateOOB() method to work in blocks of 1000 or less, as Oracle does not grok 1000+ items in an IN clause,

              but reducing the roundtrips should help a lot. Not sure if that brings one down from 50mins to seconds though.

              How many schedules do you have?

              • 4. Re: Ridiculous time taken to compute OOBs (on Oracle); RHQ 4.9
                genman

                I have 1,600,000 schedules, more or less.

                 

                Doing a little math, for a 60 minute period that means a single select query has to return in under about 2 millisecond, which is possible but maybe not reasonable.

                 

                I'm well aware of the Oracle limitation. (I have submitted bugfixes for two cases...)

                 

                I have an untested patch that might work okay. There might be a way to optimize it, and maybe save some memory...

                 

                diff --git a/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/measurement/MeasurementOOBManagerBean.java b/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/measurement/MeasurementOOBManagerBean.java
                index f742c7f..0ca8341 100644
                --- a/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/measurement/MeasurementOOBManagerBean.java
                +++ b/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/measurement/MeasurementOOBManagerBean.java
                @@ -38,7 +38,7 @@
                
                
                import org.apache.commons.logging.Log;
                import org.apache.commons.logging.LogFactory;
                -
                +import org.jboss.marshalling.util.IntKeyMap;
                import org.rhq.core.db.DatabaseType;
                import org.rhq.core.db.DatabaseTypeFactory;
                import org.rhq.core.db.H2DatabaseType;
                @@ -200,13 +200,22 @@ public void computeOOBsForLastHour(Subject subject, Iterable<AggregateNumericMet
                         int count = 0;
                         long startTime = System.currentTimeMillis();
                         try {
                +            IntKeyMap<AggregateNumericMetric> map = new IntKeyMap<AggregateNumericMetric>(1024 * 10);
                             for (AggregateNumericMetric metric : metrics) {
                -                try {
                -                    count += oobManager.calculateOOB(metric);
                -                } catch (Exception e) {
                -                    log.error("An error occurred while calculating OOBs for " + metric, e);
                -                    throw new RuntimeException(e);
                +                map.put(metric.getScheduleId(), metric);
                +            }
                +            metrics = null; // save memory
                +            try {
                +                List<MeasurementBaseline> baselines = entityManager.createQuery(
                +                    "select baseline from MeasurementBaseline baseline")
                +                    .getResultList();
                +                for (MeasurementBaseline baseline : baselines) {
                +                    AggregateNumericMetric metric = map.get(baseline.getScheduleId());
                +                    if (metric != null) {
                +                        count += oobManager.calculateOOB(metric, baseline);
                +                    }
                                 }
                +            } catch (Exception e) {
                +                log.error("An error occurred while calculating OOBs", e);
                +                throw new RuntimeException(e);
                             }
                         } finally {
                             long endTime = System.currentTimeMillis();
                @@ -216,17 +225,8 @@ public void computeOOBsForLastHour(Subject subject, Iterable<AggregateNumericMet
                         }
                     }
                
                
                -    @SuppressWarnings("unchecked")
                     @TransactionAttribute(value = TransactionAttributeType.REQUIRES_NEW)
                -    public int calculateOOB(AggregateNumericMetric metric) {
                -        List<MeasurementBaseline> baselines = entityManager.createQuery(
                -            "select baseline from MeasurementBaseline baseline where baseline.schedule.id = :scheduleId")
                -            .setParameter("scheduleId", metric.getScheduleId())
                -            .getResultList();
                -        if (baselines.isEmpty()) {
                -            return 0;
                -        }
                -        MeasurementBaseline baseline = baselines.get(0);
                +    public int calculateOOB(AggregateNumericMetric metric, MeasurementBaseline baseline) {
                         Long upperDelta = null;
                         Long lowerDelta = null;
                
                
                diff --git a/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/measurement/MeasurementOOBManagerLocal.java b/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/measurement/MeasurementOOBManagerLocal.java
                index 853c5af..d9cc751 100644
                --- a/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/measurement/MeasurementOOBManagerLocal.java
                +++ b/modules/enterprise/server/jar/src/main/java/org/rhq/enterprise/server/measurement/MeasurementOOBManagerLocal.java
                @@ -21,6 +21,7 @@
                import javax.ejb.Local;
                
                
                import org.rhq.core.domain.auth.Subject;
                +import org.rhq.core.domain.measurement.MeasurementBaseline;
                import org.rhq.core.domain.measurement.MeasurementSchedule;
                import org.rhq.core.domain.measurement.composite.MeasurementOOBComposite;
                import org.rhq.core.domain.util.PageControl;
                @@ -64,9 +65,12 @@
                      *
                      * @param metric The 1 hr metric that is used to determine whether or not an OOB should
                      *               be generated
                +     * @param baseline The existing baseline
                +     *
                      * @return 1 if an OOB is generated, 0 otherwise
                      */
                -    int calculateOOB(AggregateNumericMetric metric);
                +    int calculateOOB(AggregateNumericMetric metric, MeasurementBaseline baseline);
                +
                
                
                     /**
                      * Return OOB Composites that contain all information about the OOBs in a given time as aggregates.
                
                
                

                 

                One clever thing you can do here is delete the baseline if the metric isn't present.

                 

                The other is size the map according to the number aggregates.

                • 5. Re: Ridiculous time taken to compute OOBs (on Oracle); RHQ 4.9
                  john.sanda

                  Elias,

                   

                  I do not have a bug created yet for OOB calculations performance improvements, but it is a known issue and on the radar. In the short term maybe we can get BZ 1059863 done for 4.10. In the long term, I think it makes sense to consider moving baselines and OOBs into Cassandra. Regardless of whether or not that happens, there are plenty of performance improvements that can be made with the OOB calculations. If we wind up migrating that data to Cassandra, I would prefer to make the optimizations after the migration.

                  1 of 1 people found this helpful
                  • 6. Re: Ridiculous time taken to compute OOBs (on Oracle); RHQ 4.9
                    pilhuhn

                    I've added a possible patch to BZ 1059412 that cuts OOB processing time for me from 116s to 17s for 124k enabled schedules (the ones for OOBs are obviously less, as the OOBs are only calculated for dynamic metrics and not for trendsup/down).

                     

                    I personally think that in the long run we need to get the processing of aggregation, baselines and OOBs directly into the storage node (process) to reduce all the round-tripping between server and storage.

                    • 7. Re: Ridiculous time taken to compute OOBs (on Oracle); RHQ 4.9
                      genman

                      I definitely would like to see the aggregation happen on the storage side, but from what I observe, the bottleneck (in terms of Cassandra performance) is primarily on the read side, and processing on the Cassandra side might not help it.

                       

                      When doing aggregation, data isn't processed in a bulk way, but by hitting specific schedules during a particular time range. I'm not sure if the data design or compression algorithm can be improved to speed aggregation. The straightforward approach would be to read in all data from a storage node (all schedules), keep the ones matching a time range, do the calculation in memory, then write out the data. Or when writing the raw metrics, use a counter for tracking the number of datapoints and number of samples for the past hour, then convert that to one hour data.