Having the infrastructure to collect activity information and present it to IT and business users is all very well - but we also need to consider what information to present, and in what way.
Monitoring low level activity will generate a large amount of data. To make this information useful, we need to understand in what ways it should be correlated and filtered, to avoid overwhelming users, but at the same time ensure they are actively notified when situations of interest occur.
So does this imply that all we really need is notifications? Possibly - being emailed or SMS'd when a warning or alert has occurred could be a useful way to inform the relevant people of actions that need to be taken. These notifications could also be used to trigger automated actions. However care must also be taken with this approach, as when situations go bad, sending out thousands of notifications (texts, emails, etc) can add to the problems.
Notifications are fine for known situations of interest, and have the benefit of not requiring a user to constantly monitor a UI - but what about new situations (ad hoc) that need to be investigated, or when a user receiving a notification needs to delve into the details of the issue?
So we come back to some form of UI - which presents the problem of how do users navigate the large quantity of information. As a by-product of some of the analysis, high level correlation may have derived some abstractions that could be useful (e.g. conversation, transaction, etc) but what abstractions are the most useful? Even with higher level abstractions, there are still potentially many thousands (or hundreds of thousands) of instances of these abstractions to deal with - so how will a user find the information of interest.
This is the problem we need to tackle. Suggestions welcome
I've mentioned this before, but it seemed like the old Eclipse TPTP project had some good ideas. Specifially, the common base event model seemed like a good basis for defining events that could be correlated across servers and processes. The tools project also supported event filtering and pattern recognition (i.e. problem diagnoses could be generated for event chains matching certain patterns). I know the project is stagnant, but maybe there's something useful there, if only ideas.
That said, it sounds like the other side of the problem is how to aggregate all that data into something a user can process quickly and easily. Stream based processing might be useful for aggregating certain aspects of the data over time. Those aggregators could themselves feed other stream processors that could eventually present the user with a distilled view of what is going on. Assuming all of the events are persisted, the user should easily be able to drill into a specific subset of data that relates to the distilled information.
So, back to your question, what does that processing chain look like? Hmm...
Not sure if this is useful or not.
Yes I think the TPTP model could provide some useful guidance on a model, and the techniques you mention will be useful in processing the data.
I guess what I was looking for is more from the top down - wanting potential users of the technology, or consultants that deal with such users (and therefore have a good understanding of their requirements), to provide the "what", so that we can work out the "how" later
Exploring this idea of drilling down into the information in more detail, we can broadly consider there to be three levels of interest, (1) Business, (2) Service and (3) IT.
I think the only important layer is (1), as layers (2) and (3) are ultimately just there to support (1), which generates the revenue for the business. So although each layer will have its 'administrators' who are concerned with managing that layer, it may be useful to initially explore this area from the top layer down.
So first issue is, what proportion of an organisation's "business processes" are manageable based on some form of structured definition? By that I mean where the execution of the business process can be traced through a set of co-operating systems, to enable them to be monitored and reasoned over. If an organisation doesn't function through the use of some form of automated business process, then probably the most that can be achieved is to provide service activity monitoring support for (2).
If we assume that most large organisations have one or more automated "end to end" business processes, then that gives us a frame of reference when communicating issues to the end user.
Firstly, a system can be monitored against those business processes to build up a profile of how those processes perform under normal circumstances, and therefore give reference information (i.e. a template) upon which to identify deviations that may highlight issues.
Secondly, accumulating such historic information about these business processes can be useful in determining where improvements in efficiency can be made, as well as potentially providing data that can be used to simulate against a proposed new process, to determine an estimate of the improvements.
Thirdly, having this history information, in the context of a business process, enables the organisation to determine the costs associated with running that same process over different infrastructure (by plugging in an appropriate cost model), e.g. to determine the cost savings that may be made by moving some or all of their business process execution to different cloud providers.
When users need to access the history information, drilling down based on selecting a business process, and then having options to filter out business transactions based on certain criteria, could be useful. In some cases the filtering may be based on specific transaction information (e.g. a transaction id), but in other case, it may be specific to the performance characteristics associated with a part of the process.
Once relevant transactions have been identified, they should be able to drill down into details of the activities related to the transaction, which may also include details of the deployment architecture (i.e. nodes in the clusters used to execute the transaction).
From a service administrator's perspective, they may be more interested in how the services are performing. However the context that can be provided by an understanding of the business processes, could be invaluable in ensuring that they tackle the highest priority issues first. So although a particular service may indicate that it is performing poorly, if that service is vary rarely used - or is only used by low priority business processes, then it is better for the service administrators to focus their efforts where it will have a more positive impact on the business.
Same applies to the IT level - the priority of IT issues can be determined by their impact on associated services, which indirectly impact the business goals.
So possibly the initial focus should be on defining models that can be used as profiles/templates to:
(1) assess the executing system against normal parameters - and with the overall understanding of their role in the business context, this can help to assess priorities.
(2) generate appropriate notifications when the measured criteria is going out of bounds
(3) provide a reference model through which users can visualise the state of the system, progressively filtering out information based on various criteria
(4) different models can be used depending upon the layer of interest, e.g. BPMN2 choreography/process models for the business processes, UML deployment diagrams for the logical service layer, etc