1 2 Previous Next 16 Replies Latest reply on Mar 12, 2008 4:14 PM by manik

    New state transfer in JBoss Cache

    manik

      Guys,

      Based on some earlier discussions, I've started this design thread as well as a wiki page (http://wiki.jboss.org/wiki/Wiki.jsp?page=JBossCacheNewStateTransfer) and a JIRA task (JBCACHE-1236) to contain ideas for a new state transfer mechanism.

      Please have a look and comment accordingly.

        • 1. Re: New state transfer in JBoss Cache
          belaban

          I'll take a look.
          Manik, have you thought about an iterative approach to transferring state ? E.g. a low-level, low-priority task on the state provider which iterates over the tree a few times and sends the state to new joiner in small slices, and is only active when there isn't much load on the tree ?
          IMO a number of applications could live with such a state transfer, e.g. active-passive configs

          • 2. Re: New state transfer in JBoss Cache
            manik

            Yes, active-passive could also work with these low-prio threads working off different disjoint data sets on different "providers" simultaneously.

            The only problem here is maintaining integrity of the state when there are other threads updating the state at the same time - the state may be outdated. But then again, assuming updates are idempotent and that the joiner is queueing up updates while waiting for state, this shouldn't be a problem.

            The only drawback is that it may take a lot longer (significantly longer if the state provider thread is low-prio) for the joiner to get state, but that shouldn't be a problem in an active-passive scenario.

            I think providing pluggable policies/mechanisms for state transfer may be a good thing as well, since a "one size fits all" solution for ST almost certainly won't work. Just means cleaner abstraction between state transfer and normal cache operations, but a lot of forethought into what ST may involve when designing an SPI for this.

            • 3. Re: New state transfer in JBoss Cache
              belaban

              I thought about state transfer today, and it is significantly more complex than I had previously thought.
              I'm not referring to correct state with respect to other messages, we can use digests for that.
              The complexity lies in getting the state at the state provider *while* it is being updated, e.g. by transactions from other nodes.
              More thought needed next week...

              • 4. Re: New state transfer in JBoss Cache
                brian.stansberry

                Do you mean you see a problem with getting the state even with MVCC in place?

                Either way, MVCC is quite a ways off and we discussed support for P/L and O/L for at least a year after it's available. So, I think we need to get a reasonable solution for the P/L and O/L cases ASAP.

                Re: the "queue messages during state transfer" approach discussed on the wiki, we have (in the 1.4.X branch at least) an existing implementation that is very close to what is discussed. Partial state transfer done during region activation implements the message queueing; it should be fairly trivial to extend the same logic to the full initial state transfer.

                Responding positively to prepare even though we just queue the message still worries me; one concern is D receives things in different order than A did due to message retransmission. One possible way to mitigate this is to add some intelligence to how the queue gets applied; e.g. before applying a prepare, look for rollback or commit; discard both if rollback, move commit up in queue so it gets applied immediately.

                The queuing bit for partial state transfer was implemented because we were using RPC calls for that rather than JGroups state transfer. With JG state transfer we could get the same effect from the JGroups layer using digests/retransmission. However, JGroups couldn't do the prepare/rollback/commit matching I discuss above. Also, JGroups will only detect and retransmit missing messages next time a sender sends, which potentially could be a long time (maybe never).

                • 5. Re: New state transfer in JBoss Cache
                  belaban

                   

                  "bstansberry@jboss.com" wrote:
                  Do you mean you see a problem with getting the state even with MVCC in place?


                  I've only looked at optimistic locking so far. Pessimistic is probably harder to get right because the chances of locks on the nodes are higher than with O/L.

                  My current ideas tend towards a queue for prepares, state transfer requests and commits/rollbacks.

                  Prepares acquire locks, commits copy data back into the tree from the workspace. I can argue that if D wants to acquire state from A {A,B,C,D} that (in random order)
                  - the first prepare in the queue will always succeed
                  - there is never a commit for TX-N that was started after D joined and asked for state, because D will not reply to prepares during state transfer
                  - A state request from D has to be pushed to the top of the queue (commits can be ahead of the state request because commits only copy data and release locks)
                  - The queue is suspended during state transfer. I don't think we should reply YES to a prepare for a TX if we cannot be 100% certain we'll be able to commit it later
                  - I can add queueing to JGroups, but I'm not sure it's a gain, because missing messages get retransmitted anyway. If we waited for more then LockAcquisitionTimeout ms, then all prepares will result in corresponding rollbacks later anyway

                  I need to write these ideas down in a more coherent form next week...


                  Either way, MVCC is quite a ways off and we discussed support for P/L and O/L for at least a year after it's available. So, I think we need to get a reasonable solution for the P/L and O/L cases ASAP.


                  Agreed.


                  Re: the "queue messages during state transfer" approach discussed on the wiki, we have (in the 1.4.X branch at least) an existing implementation that is very close to what is discussed. Partial state transfer done during region activation implements the message queueing; it should be fairly trivial to extend the same logic to the full initial state transfer.


                  Well, we cannot indiscriminately queue all messages. It is important that we know exactly *when* to start queueing, otherwise we get (a) duplicate messages or (b) will miss some messages. That's why I use consistent cuts (digests) in JGroups.


                  Responding positively to prepare even though we just queue the message still worries me; one concern is D receives things in different order than A did due to message retransmission. One possible way to mitigate this is to add some intelligence to how the queue gets applied; e.g. before applying a prepare, look for rollback or commit; discard both if rollback, move commit up in queue so it gets applied immediately.


                  Sounds complex. I agree with your assertion above that responding with YES to a prepare is probably not a good idea, as we are really not sure we can later commit. We can only be sure if we hold the lock to the real tree and have the data shipped with prepare in the workspace.


                  The queuing bit for partial state transfer was implemented because we were using RPC calls for that rather than JGroups state transfer. With JG state transfer we could get the same effect from the JGroups layer using digests/retransmission. However, JGroups couldn't do the prepare/rollback/commit matching I discuss above. Also, JGroups will only detect and retransmit missing messages next time a sender sends, which potentially could be a long time (maybe never).


                  No, JGroups uses stability messages to detect the 'last message missing'. So this is not an issue. Maybe some amount of queueing can be done in the JGroups stack itself, e.g. by Channel.getState() sending down a START_QUEUEING event and - when the state has been received - the complementary event to stop the queueing.
                  Queueing would reduce the number of messages that have to get retransmitted. Of course, any queue has to be bounded, otherwise we'd happily queue until an OOME occurs.

                  • 6. Re: New state transfer in JBoss Cache
                    manik

                    I agree that a solution needs to exist for O/L and P/L as well as MVCC - not only because we don't have MVCC right now, but also because even after we do, people will still have the choice to use the more tried-and-tested O/L and P/L.

                    Regarding responding positively to a prepare, perhaps I'm missing something here - why would it be wrong for new joiner - D - to respond positively?

                    Let's look at it the other way around. Why would an existing node, say, C, respond negatively? IMO there are only 2 scenarios where C would respond negatively to a prepare from A:

                    1. There is another local transaction on C that holds locks on the same data.
                    2. There is another concurrent prepare, say from B, on the same data.

                    (Have I missed anything?)

                    Now neither of these reasons for a negative response to a prepare should apply to the new joiner, D:

                    1. would never happen since the new joiner hasn't reached the STARTED state and isn't accepting any local transactions.
                    2. D doesn't care about conflicting remote prepares since at least someone else in the cluster would have detected this and the prepare would roll back anyway. E.g., if A and B broadcast prepares on the same data and D responds positively to both, either A or B will fail on each other's prepares and broadcast a rollback anyway.

                    Thoughts?

                    • 7. Re: New state transfer in JBoss Cache
                      brian.stansberry

                       

                      D doesn't care about conflicting remote prepares since at least someone else in the cluster would have detected this and the prepare would roll back anyway.


                      Here's the specific scenario I was thinking about when I mentioned "one concern is D receives things in different order than A did due to message retransmission."

                      You've got {A, B, C} with new joiner D.

                      Messages as seen by {A, B, C}

                      1) B:PREPARE-TX1
                      2) B:COMMIT-TX1
                      3) C:PREPARE-TX2
                      4) C:COMMIT-TX2

                      TX1 and TX2 have a locking conflict, but there is no issue due to above ordering.

                      But, D has a hiccup and drops message 2, requiring JGroups retransmission. Our JGroups stacks don't provide total ordering, just ordering from the same sender. So, now D sees and queues the sequence as:

                      1) B:PREPARE-TX1
                      2) C:PREPARE-TX2 -- oops blocks!!
                      3) B:COMMIT-TX1 -- doesn't get taken off queue due to block
                      4) C:COMMIT-TX2

                      When D drains the queue, message #2 will block due to lock conflict, preventing the needed message #3.

                      JGroups' concurrent stack avoids that kind of problem during normal operation by allowing the B and C messages to be processed in parallel. :-) But, as soon as you introduce a single queue, you can get the problem back again. (The message reordering thing I mentioned was meant to help deal with that by scanning the queue and trying to handle B:COMMIT-TX1 before C:PREPARE-TX2.)

                      This is kind of an edge case, but it's there.

                      • 8. Re: New state transfer in JBoss Cache
                        manik

                        I see what you mean. Perhaps some optimisation could be done on the queue before it is replayed? E.g., pull out all PREPAREs and corresponding ROLLBACKs, so you are now left with PREPARE and COMMIT combinations that you know should work as they have been applied on other nodes. Then make sure they are contiguous so that a PREPARE is followed immediately by it's corresponding COMMIT.

                        Would this have any ordering issues? I guess with P/L perhaps not since the prepare would logically hold locks until the commit runs so the PREPAREs would dictate order.

                        With O/L the result of a later commit would be a rollback which would be weeded out of the queue by this stage.

                        And with MVCC writes will behave a bit like P/L so I guess we are ok here.

                        • 9. Re: New state transfer in JBoss Cache
                          manik

                          Bela added some questions to the wiki page:

                          "Bela Ban" wrote:

                          How can remote transactions get applied successfully if the local transaction holds write-locks ? Are these TXs going to time out ?


                          I think one of the options around this was to proxy all local calls on A to, say, B so they get serialized with incoming calls. And will then be allowed to complete along with incoming calls.

                          • 10. Re: New state transfer in JBoss Cache
                            manik

                             

                            "Bela Ban" wrote:

                            What happens with asynchronous replication ? A committed TX will be sent asynchronously in a one-phase commit as a combined PREPARE/COMMIT to D. Does D queue these ?

                            ANSWER: we don't need to queue them because they will get applied to A's state, and so we have them as part of A's state...


                            No, but just like 2PC messages, D needs to respond positively to 1PC prepare/commit messages so that the tx originator can complete the txs. After responding positively, D would discard the messages assuming that they will either be incorporated in A's state or be a part of the tx log A sends along with the state.

                            "Bela Ban" wrote:

                            Idempotency: creation, deletion and moving of data/nodes is not imdempotent. Is idempotency required in this design ?


                            Need to think whether idempotency is still a requirement even after changes made in Orlando.

                            • 11. Re: New state transfer in JBoss Cache
                              belaban

                               

                              "manik.surtani@jboss.com" wrote:


                              I think one of the options around this was to proxy all local calls on A to, say, B so they get serialized with incoming calls. And will then be allowed to complete along with incoming calls.


                              Can you elaborate ? A makes a local put(/a/b/c), this translates to a remote put(/a/b/c) on B ? A commits, this translates to a commit on B ?

                              What happens if A made a local put(1/2/3) *before* D requested a state transfer ? A hold a write-lock on /1/2/3, even with MVCC...

                              • 12. Re: New state transfer in JBoss Cache
                                manik

                                Well, A's put on /a/b/c doesn't become a "remote" put, the entire call is proxied to B. I.e., B handles the put. It becomes local to B, and is treated just as any other tx on B.

                                The purpose of this is so that it can get into the TX Log that A is building up for D. The reason why it isn't just added directly is so that ordering can be maintained.

                                Regarding A's local put on /1/2/3 before D requests a state transfer, perhaps we need to wait for this tx to finish? But in the meanwhile, proxy new local txs on A to B, and start recording remote txs that come in to A? This would bring back the need for idempotency of operations though, since we don't know if the state generated would already contain stuff in the tx log or not.


                                • 13. Re: New state transfer in JBoss Cache
                                  belaban

                                  Waiting for a TX to complete brings us back to the initial solution (wait-then-break-the-locks), which we thought was not good as it would (1) need time to wait for completion and (2) cause TXs to get rolled back.

                                  Regarding proxying: who redirects the caller from A to B ? What if A has state e.g. pertaining to a session that is available only on A and its buddy C (but not B) ?

                                  • 14. Re: New state transfer in JBoss Cache
                                    jason.greene

                                     

                                    "bstansberry@jboss.com" wrote:
                                    So, I think we need to get a reasonable solution for the P/L and O/L cases ASAP.


                                    We talked about P/L in Orlando, and basically, we just can't do efficient/concurrent state transfer with P/L so what we decided was to leave the old transfer mechanism in place for P/L, and use the modern solution for O/L and MVCC.

                                    1 2 Previous Next