Design Note: Activity Monitoring/Management in SAVARA

This design note will capture design information concerning the activity monitoring and management within Savara.

 

Note: the scope of this work is not intended to provide a full scale Service or Business Activity Monitoring (BAM) solution, where any range of events may need to be cross referenced/correlated using a CEP (Complex Event Processing) capability to derive higher level business events.

 

Instead, this effort is intended to provide primarily interaction based event reporting, persistence and validation through the use of protocols represented using WS-CDL and BPMN2. These events can then be used as part of a wider BAM/SAM solution as appropriate.

 

Status: work in progress

 

 


Activity Information

 

The types of information we need to consider can be divided into three categories:

 

Component Activity

 

A component could refer to a:

 

a) Process instance, running inside a process engine. This would have a ‘process definition’ name and/or id, and an instance id.

 

b) Task. This would have a ‘task definition’ name/id and an instance id.

 

c) Service. A service (such as a Web Service, ESB service etc) will have a type, but will generally not have an ‘instance id’ concept, as they are generally stateless.

 

d) Application. A general application, or possibly a web app providing a front end to a business solution. Like service, it would generally only have a ‘type’ - although its possible that some frontend apps may have a user defined concept of session, but this may just be encoded in the domain specific information.

 

So the common information across these categories is:

 

i) Component category - might be useful to know whether this is a process, task, service, etc - or even more find grained - Process::BPEL - so have a scoped namespace.

 

ii) Component type - so this would be the process definition, service type, etc. One issue is whether a type name and id is required - an id may be more internally relevant within the component’s execution environment, whereas the name more human understandable. Both could be relevant assearchable aspects.

 

iii) Component instance - this may only be relevant in some situations, and would therefore be optional.

 

 

Component activity information may also be available in other formats (e.g. BPAF) from other data sources, e.g. RiftSaw, jBPM5, etc. Whether this information is transformed into the above representation, to help further analysis, or the above representation is only used to store information that is not available from other sources, is still to be considered.

 

Interaction Activity

 

Interaction activity is slightly different to component activity, and therefore I don’t believe can be represented as a ‘category’ of component activity.

 

Interactions occur between components, and therefore its possible that they could be related to the source or destination component in some way, but in many situations the component producing or consuming the message may be hidden - possibly under a number of layers of containing components.

 

If we have the means to tap into those components, and observe when they send or receive a message, then this approach could be used to add information to the event. However we must also cater for the situation where the sending/receiving components are anonymous.

 

In which case the common factors that we may be able to rely upon are:

 

a) Destination Type - depending upon the messaging technology being used, the destination may be a ‘service’ that provides a typed interface or contract. For example, if a web service, then this could be the service QName.

 

b) Destination Address - all destinations will have an endpoint address that the message is being sent to. This could be a HTTP URL, a JMS message queue, etc. One of the issues with addresses is that they could be re-routed by other network components to reach the actual service. This may result in problems if searching for activities based on a particular endpoint address, unless some form of registry/repository is available to provide any relevant mapping information.

 

c) Source Type - optional, as not always available, but where known it could describe the service type sending the message.

 

d) Source Address - optional, as not always available, but where known it could describe the endpoint address of the sender. This may be used to capture some correlation information between a request and response. For example,  if the request carries a ‘reply to’ address, sometimes these addresses are temporary and therefore useless for correlating to a known endpoint. However when the response is being returned, it could provide the request’s destination address as the source address, enabling analysis software to understand the relationship.

 

 

There are two options:

 

1) The interaction activity event could carry the source type/address information, which will generally not be available. In situations where it may be known, i.e. when handling a response to a request, it could be included for future analysis.

 

2) Only define the destination type and address, but also include a ‘reply-to’ address as a searchable field. Then the correlation to a subsequent response, that has a destination address associated with the previous reply-to address, would be possible.

 

 

Option (1) makes the relationship more explicit in the response interaction, and caters for situations where the source type/address is known in advance. However this requires the monitoring agents to maintain state to correlate the incoming request to the outbound response.

 

 

Option (2) makes the relationship more explicit in the request interaction, but does not cater for situations where the source is known. However this could be represented in ‘additional details’ that are not key fields. It does mean that no specific stateful behaviour is required in the monitoring agents.

 

 

I believe option (2) should be the one implemented.

 

 

Correlation between Activities

 

Occasionally when an event is created, it will have additional information that may identify its context in a scope that the component or interation for which it is being reported.

 

A typically example would be a business transaction context. BPEL and BPMN2 processes will derive identity information that will be used to correlate messages to a particular process instance. This information should be recorded against the relevant activity events generated by the respective processes.

 

Post-analysis of these process activity events would identify the business context associated with particular process instances, and from that enable those separate process instances to be analysed in the context of the single business transaction.

 

Although process engines may enable the context information to be determined when the event is created, other situations may require the information to be derived after the fact.

 

For example, an interaction activity event may identify a business message that has been sent. However it may not have any explicit business context information associated with it. When the interaction event is stored in the activity database, a post-analysis process could examine the business message, against predefined xpaths, to extract the relevant context information for the particular message type.

 

Derived Information

 

The activity events will go through various stages of analysis, some of which will be transient, and some that will need to be persisted against the events.

 

Therefore the activity event structure needs the ability to store additional domain specific derived information.

 

One purpose for this feature will be to enable Savara to store behavioural analysis results with the interaction events. When a 'send' or 'receive' event occurs, we want to be able to determine whether it correctly conformed to one or more protocols, and record the result against the event.

 

 

Schema

 

A proposed schema could be:

 

 

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.savara.org/activity" xmlns:tns="http://www.savara.org/activity" elementFormDefault="qualified">

    <complexType name="Activity">
        <sequence>
            <element name="analysis" type="tns:Analysis" minOccurs="0"
                maxOccurs="unbounded">
            </element>
            <element name="context" type="tns:Context" minOccurs="0"
                maxOccurs="unbounded">
            </element>
        </sequence>
        <attribute name="id" type="ID"></attribute>
        <attribute name="timestamp" type="dateTime"></attribute>
    </complexType>

    <complexType name="Analysis" abstract="true"></complexType>

    <complexType name="Context">
        <attribute name="name" type="string"></attribute>
        <attribute name="value" type="string"></attribute>
    </complexType>

    <complexType name="ComponentActivity">
        <complexContent>
            <extension base="tns:Activity">
                 <attribute name="instanceId" type="string"></attribute>
                 <attribute name="componentId" type="string"></attribute>
                 <attribute name="componentName" type="string"></attribute>
                 <attribute name="status" type="tns:Status"></attribute>
            </extension>
        </complexContent>
    </complexType>

    <simpleType name="Status">
        <restriction base="string">
            <enumeration value="Started"></enumeration>
            <enumeration value="Finished"></enumeration>
        </restriction>
    </simpleType>

    <complexType name="InteractionActivity">
        <complexContent>
            <extension base="tns:Activity">
                <sequence>
                    <element name="parameter" type="tns:MessageParameter"></element>
                </sequence>
                <attribute name="destinationType" type="string"></attribute>
                <attribute name="destinationAddress" type="string"></attribute>
                <attribute name="replyToAddress" type="string"></attribute>
                <attribute name="operationName" type="string"></attribute>
                <attribute name="faultName" type="string"></attribute>
                <attribute name="request" type="boolean" default="true"></attribute>
                <attribute name="outbound" type="boolean" default="true"></attribute>
            </extension>
        </complexContent>
    </complexType>

    <complexType name="MessageParameter">
        <sequence>
            <element name="value" type="anyURI"></element>
        </sequence>
        <attribute name="type" type="string"></attribute>
    </complexType>

    <complexType name="ProtocolAnalysis">
        <complexContent>
            <extension base="tns:Analysis">
                <attribute name="protocol" type="string"></attribute>
                <attribute name="role" type="string"></attribute>
                <attribute name="expected" type="boolean" default="true"></attribute>
            </extension>
        </complexContent>
    </complexType>

</schema>

 

Note: the 'status' enumerated fields need to be extended - possibly using a standard such as BPAF as the basis.

 

 

Activity Event Collection Architecture

 

This section discusses the mechansm for collecting activity information from a distributed system containing a set of interacting components that will be subject to process "runtime" governance.

 

There are a number of configurations that may need to be supported:

 

(1) Local detection, validation and storage

 

(2) Local detection with central validation and storage

 

(3) Local detection and validation, with central validation and storage

 

On top of these configurations, we also may wish to have (a) analysers at the local or central locations, to derive additional information, and (b) filters locally or centrally to determine whether the evet is of interest.

 

 

Activity Processor

 

The 'activity processor' component could be the overall component within remote and central environments. This component would have an API for reporting activity events, that could be used by other components in the environment. The activity processor could also have:

 

 

Analysers

 

The analysers are responsible for deriving additional information for the reported events, and associating the information with the event.

 

Filters

 

The filters are responsible for determining whether the event should be reported. If no filters are defined, then the event will be published.

 

Validators

 

The activity processor could have zero or more validators. These would validate the filtered events, and optionally store information against the event.

 

Analysis information can be stored in the 'analysis' elements, which will provide additional domain specific information derived about the event.

 

However if the validator wishs to record information against the event, that could be used within queries, then it should be defined using the 'context' elements.

 

One of the validators could be a 'protocol validator' that would verify whether interaction activities conform to defned protocols.

 

One question maybe why are both analysers and validators required? Currently I can see two reasons to have the distinction:

 

a) The validators may operate on information that is not available in the original event, and derived using the analysers. If there was only one construct, (i.e. analysers), then we couldn't necessarily guarantee the order in which they would be called - and therefore the derived information may not yet be available when the validation is being performed.

 

b) Analysers are expected to be efficiently dervied information required by the filters and subsequent components, so can be performed on all events. The validators are going to be less efficient (e.g. protocol validation), and should therefore only be performed once the events have been filtered and any other supporting information that may determine whether validation is relevant, has been obtained.

 

Storage

 

The storage component will either persist the event, or use a remoting mechanism to transfer it to a central location. This component may be optional, in which case events will be processed using the other components, potentially just to be sent out via a notifier.

 

Where defined, the storage component should associate a unique id against the event before persisting it against that id.

 

 

Notifiers

 

Zero or more notifiers could be defined that can publish the activity events.

 

Ideally a storage component should have been used prior to the event being published, so that it has a pre-allocated unique id. This is only really required if the subscriber to the event is likely to do further analysis on the event and wish to update its information in the 'activity store'.

 

One question would be whether different filters are required for storage and notifiers? This could possibly be based on different information returned from the filters.

 

 

Distribution Mechanisms

 

The 'storage' component can be replaced with a proxy implementation within remote systems, to report the actvity event information to a central location that may persistently store the activity information and trigger other processing. This section discusses some considerations for such a mechanism.

 

Styles

 

There are two possible styles that could be provided to accumulate the activity information from the distributed environment being monitored.

Immediate

 

This approach means that activity events are distributed in realtime (or near realtime), so that any analysis is performed as quickly as possible.

 

Store and Forward

 

For efficiency purposes, it may be appropriate to cache activity events in the local environment and transfer them to an intermediate or final (central) location periodically.

 

Obviously this is more efficient, but the downside is the delay in activity events being available for further more central analysis, and the issues that may arise if the node becomes unavailable - although ideally the cached events should be transferred again once the node comes up, but that assumes the cache is persistent in the local environment.

 

Technologies

 

A range of technologies can be used, so a fixed list does not need to be defined.

 

One of the initial implementations is likely to be JMS.

 

 

Event Sources

 

A variety of event sources may be required, to capture the relevant information. This could be achieved based on a standard set of components that have appropriate configurations.

 

Technology Specific Interceptors

 

For monitoring interactons on technologies such as JAX-WS based web service stacks, ESB, etc we may need to define specific 'interceptors' to observe the messages and pass them to the 'activity processor' component.

 

Log File Adapter

 

A log file adapter component may be required that can be configured with the location of the log file and some transformation configuration to process the log record format into the required representation.

 

Database Adapter

 

When log information is stored within a database, we need to provide an adapter to obtain new logged records and provide transformation configuration to process the log record format into the required representation.

 

 

 

Activity Query Architecture

 

The query aspect of the activity monitoring/management architecture is more straightforward in terms of architecture. The complexity with this mechanism is going to be how to support the range of queries that may be required by applications and users (via a suitable user interface).

 

In terms of implementation, it is likely that the same interface used to provide the 'activity store' in the activity collection architecture will be used for query as well. This would enable central and remote applications to similarly retrieve information.

 

The first distinction from a query perspective is the event type, currently just component and interaction based activity events. Within those groupings there may be other common queries that will be discussed in the following sub-sections.

 

Following the discussion on required query support, there will be a discussion on how a user interface may enable a user to navigate this information in an appropriate and effective manner.

 

Querying Activity Information

 

Common Queries

 

  • All events between within a time range - this may be used in conjunction with any other query
  • Distinct context names - return list of names used in the context fields
  • Activity events with a particular context name and value - name is possibly optional, in which case search on value only

 

When querying context values, it would be useful to be able to query using set, pattern matching and range operators, as well as simple equivalence.

 

Interaction based Activity

 

  • List of Activity Events associated with any combination of Destination Type and/or Address (this could be performed in combination with time range and/or context name/value list)
    • For example, list of interaction activities with destination type 'OrderService', over the last week, with OrderId = 'abc123'

 

Component based Activity

 

  • List of Activity Events associated with any combination of the Instance Id, Component Id and/or Component Name

 

 

 

User Interface for Visualising the Activity Information

 

Protocol Focused User Navigation

 

Protocol validation will be performed against the interaction based activity information, to determine if the interactions are valid. The protocol definitions can be retrieved from a common repository shared with the protocol validators.

 

These protocols represent business processes that span multiple services. Therefore users may wish to see a global view of these business processes, and view them in the context of the protocol definition (possibly graphically), being able to access the individual events and even replay the activities (again in a graphical context).

 

It may also be possible to tie the interaction based information into the component based activity information for the relevant services/processes, based on shared contextual information.

 

So the aim of this view will be to display the list of protocol definitions (i.e. business processes - or whatever appropriate name is used) to the user, and then allow them to select other search criteria within that scope. The other criteria could include, time range, some contextual information etc. Where contextual information is used, potentially filtered based on the availabe context names associated with the selected protocol definition, multiple fields could be defined.

 

This should result in a list of 'conversation instances' being displayed in a table - each entry representing distinct combinations of the selected information. When an individual instance is selected, it should then be possible to see all activity events related to that criteria (see Conversation Instance View).

 

Conversation Instance View

 

A conversation instance represents a series of interaction based events, between different components, that are related by some contextual information.

 

This view will take a list of interaction based activity events, related by some common contextual information, and display them (hopefully) in a graphical representation (as well as a detailed table of the underlying events), and offering the ability to step through the events and see the impact on the overall protocol definition.

 

Where additional analysis information is associated with a particular event, this should be flagged, and enable the user to view the information..

 

As well as navigating to the 'conversation instance' view based on a filtered query, it should also be possible to navigate to this display by other means. The main input into this step is the protocol definition and the contextual information that identifies the conversation instance.

 

 

Error View

 

When validation is performed as part of the event collection, the information will be recorded in analysis fields. For example, validation against a protocol definition may flag an interaction based event as 'unexpected'.

 

QUESTION: Should events have a status level, e.g. info, warning, error that can be used by the validation phases - and can only be raised (i.e. an error level cannot be set by a subsequent validation module to warning). So when stored in the db, it is easy to index the recorded based on their status?

 

This display should be actively updated to reflect new erronous situations that occur. If related to an interaction based event, then should enable the user to navigate to the appropriate 'conversation instance' view as introduced above.

 

 

Component Instance View

 

In a similar manner to the conversation instance view, it will also be useful to view a correlated set of component based events in a single (hopefully) graphical and tabular form. The correlation would be based on a common instance id, as well as and component id and/or name.