JGroupsQualityRisksAS5

JGroups Quality Risks for AS 5

 

JGroups is moving from to 2.4.1.SP4 in AS 4.3 to 2.6.2 in AS 5, and many new features have been introduced (multiplexing, FLUSH, partial/streaming state transfer, view bundling, concurrent stack, out-of-band messages, RPCDispatcher filtering, etc). At least one new feature (multiplexing) is being used in AS 5, which would require tests over and above those present in AS 4.3.  This page will serve as a shared whiteboard for recording ideas on how the new features of JGroups will affect the requirements for testing in AS 5.

 

 

Key Changes by Release

 

Version

Changes To JGroups Features

2.4

multiplexing channels, FLUSH, partial and streaming state transfer, view bundling, FD_ALL, FD_ICMP

2.5

concurrent stack, OOB messages, concurrent multiplexer, SFC, FLUSH fully supports virtual synchrony

2.6

join and perform state transfer at same time, UNICAST bundling, RPCDispatcher filtering, adding data to a view, reincarnation prevention, shared transport,FLUSH subset of cluster, eager lock release in NAKACK and UNICAST, thread factory hooks, PING and multiple Gossip Routers, TCPPING parallel discovery

 

Potential Integration Problems with AS 5 due to JGroups changes

 

Risks are classified as to functional, performance and stress relatedness. Priority could be added.

 

Functional

 

ID

Feature

Potential Problem

Action

Reporter

1

Shared channels

Channel multiplexing is being used for sharing channels. Use of mux is no longer recommended. Shared transport is now the recommended approach.

check with developers (see Note1).  Test that all apps are using shared transport in AS 5.

Bela

2

Concurrent stack

Application callbacks are not thread-safe. With the old stack, callbacks such as MembershipListener, MessageListener and ChannelListenerwere never called concurrently. With the new stack, these callbacks can be called concurrently for events (state transfers, message arrivals) from different senders (see Note2). In the case where the stack contains ordering protocols, this may restrict the degree of concurrent callbacks. For example, with FIFO ordering (e.g. pbcast.NAKACK for UDP mcast and UNICAST for UDP unicast), now events from the same sender will not be concurrent, but processed in FIFO order. But in general, the application should assume that callbacks will be called concurrently.

devise tests for JBC, JBM, Clustering which test correctness under high concurrency from multiple peers. For AS Clustering: JBAS-5432

Brian

3

Shared transport

AS behaves differently under shared transport than under non-shared transport. Seeing that the shared transport is so pervasive, this is a possibility.

run AS testsuite in both configurations and compare

Richard

4

Shared transport

conflicting configurations. When channels share a common transport, they define full stacks including  the transport layer and its properties. It is possible that properties differ in the transport definitions. Depending on the order of instantiation, desired property settings can be discarded, without warning

Possibly emit warning when conflicting configurations arise. Devise tests to check that warnings are emitted

Brian

 

Note1: JBM and JBC currently use createChannelFactory to create a multiplexed channel, and changing this would require change of interface for JGroups to expose the appropriate API and new release and (ii) change to JBM and JBC code and new release. Brian will modify creation of channels to that a call to createChannelFactory will return a shared transport and not a mux channel and (ii) inject a shared transport use_singleton property into the transport if one does not exist. In this way, all apps will use s ahared transport by default. Some additional work for Hibernate standalone.

 

Note2: Among callbacks, View changes do not occur concurrently. However, it may be good practice to ensure that all callbacks are thread safe.

 

Performance

 

ID

Feature

Potential Problem

Action

Reporter

1

Shared transport

AS does not perform significantly better with shared transport. The degree of improvement in performance is at present unknown.

Comparative test of performance of AS with and without shared transport

Richard

2

Shared transport

Transport properties out of alignment with sharing multiple channels. Depending on degree of sharing, UDP/TCP transports will now be receiving more messages. UDP and TCP buffer and thread pool settings will be more likely to overflow.

Adjust UDP/TCP properties accordingly and reach threshold of thread pool exhaustion. Is a warning message generated when thread pool is full so that users can know? (Birna, Bela) We need to come up with settings (thread pools size, timer pool etc) for AS which take all 5 or 6 channel into account

Richard

3

Shared transport

AS startup time is increased. Moving from the multiplexer to shared transport increases the startup time for channels, and so the AS. Currently, AS services are deployed by a single thread; concurrent deployment for independent services is being considered.

test the impact of using shared channels on overall AS startup time

Bela

 

Stress

 

ID

Feature

Potential Problem

Action

Reporter

1

Shared transport

Classloader leaks in thread pool. Class loader leaks in JGroups thread pool over time. Effect on JGroups / applications?

Devise test to force a classloader leak in the JGroups threadpool

Brian

2

Shared transport

Starvation of one service by another. One greedy service causes starvation of thread pool resources for other services: for example, JBoss Web with many sessions under replication using up all threads in JG thread pool

Set up AS "Seam-like" app that involves web and SFSB replication, entity clustering and JBM. Introduce faults into individual services to see how the overall system reacts

Brian

 

Related Tests in JG, JBC, JBM and AS test suites

 

Identify here test suites which address the potential problems mentioned above. General indication is OK for now.

 

Functional

 

Component

Tests

Description

JGroups

SharedTransportTest

Tests of key features of the shared transport

JGroups

ConcurrentStackTest

Tests key features of the concurrent stack

 

Performance

 

Component

Tests

Description

sample component

name of suite

description of suite

 

Stress

 

Component

Tests

Description

sample component

name of suite

description of suite

 

Comments

 

Brian Stansberry (17 March 2008):

 

General comments on new JGroups feature usage in AS 5 and testing thereof:

 

In general order of "problem" significance:

 

1) Concurrent stack.  Issue here is it is now possible for AS services and JBC to concurrently receive messages from JGroups.  AS codebase has zero tests targeted at checking handling of this.  Don't know about JBC.

 

2) Shared transport. General issue that different AS services will be sharing a JGroups resource.  Couple subissues come to mind:

 

a) Lifecycle of the shared transport protocol as services start and stop.  This is really something that's better tested directly in the JGroups testsuite; the AS isn't doing anything special here.

 

b) Different services are sharing the shared transport's thread pool, thus there is possibility of conflict between services over that resource (i.e. one rogue service consumes all threads).  This is an area that needs testing.  I've briefly talked with Dominik about the need for tests of a "Seam-like" app that involves web and SFSB replication, entity clustering and JBM. With such an app we can introduce faults into individual services to see how the overall system reacts.

 

3) OOB messages.  The AS code doesn't use these directly.  JBC might for  2PC COMMIT messages. Not sure what the risk is here; just something different.

 

4) VIEW_SYNC. Clebert asked a question Friday about how this changes the timing of receipt of view changes.  Perhaps could be an issue if services have an implicit assumption about timing (which they shouldn't, since different nodes will always get views at different times.)

 

Bela Ban (17 March 2008)

 

+1 on Brian's comments, plus

 

    Shared transport: since AS 5 will use it by default, we need to

     see (a) whether the current tests in JGroups cover all possible

     uses in AS and (b) add shared transport tests directly in the

     testsuites of AS (and, to a lesser degree) in JBC

    See where we're still using MUX and make sure nobody uses it

     anymore, switch all users to the shared transport

 

The goal here is to have a valid replacement for the MUX, not to replace something flawed with something equally flawed

 

Richard Achmatowicz (2 April 2008)

 

1) Concurrent stack and callbacks: Applications interact with JGroups via callbacks such as MembershipListener, MessageListener and ChannelListener. One significant change between the old stack and the concurrent stack is that with the old stack, these callbacks were never called concurrently. With the new stack, these callbacks can be called concurrently for events (view changes, state transfers, message arrivals)  from different senders.

 

In the case where the stack contains ordering protocols, this may restrict the degree of concurrent callbacks. For example,

with FIFO ordering (e.g. pbcast.NAKACK for UDP mcast and UNICAST for UDP unicast), now events from the same sender will not be

concurrent, but processed in FIFO order. But in general, the application should assume that callbacks will be called concurrently.

 

As a consequence, at a minimum, all application level callbacks should be thread-safe.

 

2) Shared transport and concurrent thread pool: When using the shared transport, the number of incoming messages to the transport may double, triple, etc. depending on the number of channels sharing that transport. If thread pool sizes and thread pool queue sizes are not adjusted, this will probably result in events being rejected due to thread pool being overloaded. In the case that the thread pool is overloaded, the default rejection policy is to have the caller (multicast port, unicast port, TCP port) carry the event up the stack. This will effectively result in sequential processing of events (need to double check this) without any exception being raised. Thus, poor performance may be seen.

 

3) RpcDispatcher: Are all callbacks used internally in RpcDispatcher (and HAPartition, DistributedState) thread-safe?