Business Activity Monitoring Design

Version 3

Created by objectiser on Mar 6, 2012 9:40 AM. Last modified by objectiser on Apr 4, 2012 8:10 AM.

This document will outline the proposed design for the Business Activity Monitoring (BAM) Design. This work will also encompass the area of Service Activity Monitoring (SAM), which will be treated as a sub-domain, that may have specific views onto the information being collected, to enable service administrators to perform relevant tasks.

The document is divided into four main sections, each dealing with a separate phase of the BAM architecture.

Activity Event: Collection, Reporting and Querying
Event Processor Network (EPN)
Active Collections
Presentation Layer

Activity Event: Collection, Reporting and Querying

The first stage of the Business Activity Monitoring design is the "Activity Server". This component has three responsibilities:

Collection

Although activity events can be reported directly to the server, it may also be useful to have a capability embedded within a service execution environment, to automate the collection of the information as much as possible.

The additional benefit of having a "behind the scenes" collection mechanism is that it will be able to infer relationships between various reported activities that are not necessarily visible to the reporting components themselves. A prime example is having an understanding of the XA transaction in which the activities are being reported, as a means to correlate them to the same business transaction.

The collection mechanism will provide batching capabilities to send the activity event list at configurable intervals or when the list size reaches a threshold.

Reporting

An 'ActivityServer' interface will be created to provide the reporting function. This interface can be used directly, if the collection and reporting capabilities are co-located, or via a service API (e.g. REST, JMS, etc.).

Activity events are reported to the server in a list (i.e. batch) for efficiency. Where reported remotely, they will be serialized to JSON, but represented as a Java object model internally within the server and between its sub-components.

The Activity Server will have two sub-components:

ActivityStore

This component will be responsible for persisting the activity information. The default implementation will be JPA.

ActivityNotifier

The Activity Server will be configured with zero or more ActivityNotifier implementations. These will be used to trigger further processing of the activity events and provide the link to the second stage of the Business Activity Monitoring design.

Querying

The ActivityServer interface will also support query operations to enable a client to retrieve previously reported activity events. These queries will be passed through to the configured ActivityStore implementation for processing.

Various query criteria should be supported, including date/time range, event source, etc. However this may be dependent upon the information within the event model.

Model

TO BE DEFINED - although an essential part of the Business Activity Monitoring implementation, the actual information within the model does not impact the overall design, so the details of the model can be elaborated throughout the implementation, as well as supporting extensibility to incorporate future requirements.

Event Processor Network (EPN)

The second stage of the Business Activity Monitoring design is the Event Processor Network (EPN). The diagram above shows a single 'Event Processor Node' that forms a node within a graph that comprises the 'Event Processor Network'. As the name suggests, this network is responsible for processing a stream of events, with each node (and associated event processor) performing a particular function as part of this processing.

More than one Event Processor Network could co-exist within a Business Activity Monitoring infrastructure, each with a particular responsibility. For example, phase 1 of the SOA governance work primarily focuses on processing service metric information. This processing could be achieved using an EPN configured with the relevant Service Level Agreements (SLAs) being monitored - with the output from the network being the warnings and alerts to notify appropriate people of pending or actual violations.

A separate EPN could be configured to accept Activity Events (via a specific Activity Notifier implementation, as discussed previously), as a means of processing business transaction specific activity information, to detemrine other situations of interest. For example, for business transactions associated with a particular customer, an organisation may wish to monitor other metrics that would not otherwise be available with the service wide metrics collected. This mechanism can also be used to ensure that the business transaction is executing according to a valid protocol definition (e.g. derived from a BPMN2 choreography).

EPN Node

The Events will initially be introduced into a "well known" named node, which will then process the events and forward them to other nodes in the network if appropriate. As with the Activity Server, the events are passed around as a list (i.e. batch) to enable multiple events to be processed within the same transaction boundary.

When an event list is received by the EPN Node, it will perform the following tasks:

it will iterate over each event in the list, processing them individually
if an optional predicate has been configured, then the event must satisfy the predicate for it to be processed further
if appropriate, the event will be supplied to the associated Event Processor implementation
if the processed event returns a value (possibly transformed), this will be stored ready to be forwarded to other optionally configured nodes in the network
if the processed event does not return a value, then no further action is taken - this may occur if the processing is awaiting a number of events before it takes further action (e.g. temporal correlation between events), or it is possible that no further processing is necessary
if an exception occurs during the processing, then the original event is added to a retry list
when all events have been processed, if any events are in the retry list (and number of retries is less than threshold configured for the node) then the list is resubmitted to the node (possibly after some optional time delay, if supported by the container)
transformed events are forwarded to any other nodes that have been configured
if notifications have been enabled, then all successfully processed events should be reported to any interested listeners (note, this notification can be used as an input into stage 3 of the Business Activity Monitoring design)

If an Event Processor implementation maintains state, then it is the implementation's responsibility to handle any persistence and sharing of information with other instances of the EPN in a cluster. (A future enhancement could possibly be to provide container support for this).

Examples of Event Processor implementations could be CEP rules, BPMN2 based protocol validation, or bespoke implementations.

EPN Container

The container is responsible for managing the links between the nodes, handling the notification mechanism, as well as providing some initial entry point for external systems to supply events into the network.

The initial implementations for the container will be

(1) in-memory

(2) JEE (using JMS/MDBs)

Another possible future supported container could be Storm (https://github.com/nathanmarz/storm/wiki), although the setup of this project's infrastructure appears quite involved, and therefore would be an administrative overhead - however its potential performance advantages may make it worth looking at in the future.

EPN Configuration and Deployment

The basic configuration of a network will be generic. The configuration will identify the details associated with each node (e.g. the event processor to use, whether notifications should be emitted, retry threshold, etc) and the connections between them, to form the graph.

However part of the deployment step may be container specific, and therefore require (a) knowledge of the target container, and (b) the ability to post-process the initial generic configuration with target container implementation details.

Some tooling will be required to help do this - but also raises the issue of how will updates be handled? If deployed as a single module within AS7 (e.g. using the MDB container approach), then when a new version is available, the EPN will need to be shutdown and then restarted as a whole when deployed. This is probably the safest, as it ensures the network is atomically consistent. If we allow some of the event processor implementation to be changed mid flight, then it may have unknown consequences on the rest of the network, although each EPN node should equally be atomic.

It will be the containers responsibility to manage stopping and restarting the network. However this still leaves the situation where events queued up for a particular node find that the node no longer exists when the network is restarted. (JEE specific note - queues are independently created/deleted from the mdb deployments, which is good as they are not inadvertantly deleted with remaining messages - but on the other hand it means new nodes need to have their queues created before they can be deployed, and removed nodes may need some tooling to know how to deal with messages left on the queue - e.g. remove or divert are probably the only options).

Proposal is to include migration details as part of the subsequent configuration. So if a new node will no longer exist in the new graph configuration, then the new configuration will include a temporary node to deal with remaining events - to consume them and either discard or redirect them to other nodes for further processing - so a node cannot simply disappear between versions, it has to enter a deprecated stage where no input source will exist.

4/4/12 GB: New proposal is to use a versioning technique, similar to that use in BPM engines, where when a new version is available, any existing instances continue to use the old version, while new instances use the new version. In this case, once an initial event list has been triggered against a particular network, it will continue to be processed by that network until completion. However new events will be submitted to the most recent version of a network. Due to the fact that some custom event processor implementations may accompany the network deployment, this means that a network must remain deployed until all events have been processed.

At a later stage it may be possible to segment the network into smaller graphs, where the smaller graphs have a dependency on individual JEE deployable artifacts that may be independently updated - but further tooling/validation may be required to support this.

Active Collections

The two previously discussed stages are system modules/components that process the activity and metric information against configured event processing criteria. As outlined above, the EPN nodes have the ability to generate notifications when events are successfully processed, as well as building up their own result set that may be of interest to end users.

This third module is slightly different, in that it is intended as a support mechanism within the user's session, to help manage the information being generated, and where appropriate emit active notifications of changes to the result set to inform the client application to update appropriately - hence the title "active collections".

Active Query Manager

The Active Query Manager is the component responsible for managing all pre-defined Active Collections. These are the collections that each have an associated Event Source, which supplies the information to be added or updated within the collection.

An example of an Event Source would be the notifications generated from an EPN Node (as discussed in the previous section). How the notifications are delivered from the EPN Node through an Event Source is dependent upon the EPN Container being used, but as an example notifications from a JEE based EPN container may be distributed via a JMS topic, and therefore the Event Source in this situation would be a JMS subscriber on that topic.

The Active Collections managed by the Active Query Manager are shared across all user sessions on the same host. This reduces any unnecessary information duplication.

Active Collections

Active collections are the same as any other collection with the additional features that they have optional predicates to govern what they can store, support notification of changes (add, remove, update) and have optional time/size restrictions, to prevent the result set growing too large.

All changes (add, remove or update) to an Active Collection can only be applied by the system - so direct modification of the collection will result in "unsupported operation" exceptions. These collections are intended to help organise and filter information, and actively notify of any changes. These changes are expected to be derived from system components.

Locally Maintained Active Collections

As part of a user session, they can create 'derived' active collections from an existing top level or other 'derived' collection.

This newly created 'derived' collection will have its own predicate to refine the results in the parent collection. Predicates for 'derived' collections are not optional, as otherwise they would simply represent the same contents as the parent collection.

When the 'derived' collection is initially created, it will iterate through the contents of the parent collection, applying the predicate to determine if each object should be added. No change notifications will be emitted during this initialization stage - the client application will be expected to examine the contents and then receive notification for changes to that initial result set.

To avoid unncessary duplication of objects in the 'derived' collection, the iteration of a 'derived' collection could be achieved by iterating the appropriate top level active collection, while applying the predicates associated with the 'derived' collection(s) to filter out inappropriate objects.

Once initialized, the 'derived' collection will be registered as an active change listener on the parent collection. Any additions to the parent collection will be evaluated to determine if the new object should also be added to the 'derived' collection. Any deletions will cause the same object to be deleted in the 'derived' collection if present. If an object is changed in the parent collection, then if that object prevously existed in the 'derived' collection it will be re-evaluated to determine if it should remain - and if it was not previously present, will be evaluated to determine if it should be added. NOTE: If the 'derived' collections do not actually maintain their own content, then the update notifications should include the 'before' and 'after' results to allow the 'derived' collection's predicates to determine if the change is relevant.

The set of 'derived' active collections (and configurations) created by a particular user can be stored with their profile for later retrieval when the user logs back in - however this aspect is outside the scope of this module.

Active Change Listeners

A user session can listen for changes to either the top level (system) active collections, or any locally 'derived' active collections, by registering an Active Change Listener against the collection.

How those changes are then handled and pushed out to the client application is outside the scope of this module. It should also be considered whether all change notifications should be pushed to the clients - for example in a rapidly changing situation, possibly only a small sample of the changes should be transmitted. The level of filtering should be a consideration for the presentation layer.

Presentation Layer

The activity and metric information will primarily be presented through gadgets, managed through a new GWT based gadget server project.

Users will be able to retrieve pre-defined gadgets, and define their configuration values, to access information made available through the Business Activity Monitoring infrastructure. This could include:

Top level Active Collections presenting information generated from EPN nodes
Creating and presenting locally 'derived' Active Collections (e.g. gadget focused on the service metrics associated with the Order Service)
Other 'ad-hoc' queries managed either by the BAM infrastructure (independently of the Active Collections) or from other (e.g. RESTful) services

Other gadgets may include ones focused on presenting the warnings and alerts that may be generated by the EPN processing.

JBossDeveloper

Business Activity Monitoring Design

Activity Event: Collection, Reporting and Querying

Collection

Reporting

ActivityStore

ActivityNotifier

Querying

Model

Event Processor Network (EPN)

EPN Node

EPN Container

EPN Configuration and Deployment

Active Collections

Active Query Manager

Active Collections

Locally Maintained Active Collections

Active Change Listeners

Presentation Layer

Comments