4 Replies Latest reply on Aug 9, 2012 11:34 AM by rhauch

    Backup and restore API

    rhauch

      We're going to provide an efficient backup and restore capability in ModeShape 3.x (see MODE-1581). This will work at the repository level, meaning the backups will contain all of the content in all of the workspaces of a single repository. This will be useful for times when a repository needs to be recovered to an earlier state (due to a corruption, hardware failure, etc.), and it will also be part of our solution for migrating ModeShape 2.x repositories to 3.x.

       

      Note: ModeShape 3.x can rely upon an Infinispan cache that is replicated (every node is stored on every process in the Infinispan cluster) or distributed (every node is stored on a fixed number of processes, e.g., 3 or 5, in the Infinispan cluster, but where the copies are balanced across the cluster to maximize availability). Both approaches increase availability even in the event of machine failures, and in many respects reduce the need for backups (since the cache essentially actively maintains its own backups via copies of each node). However, backups are far more important for very small clusters and single-process configurations.

       

      In ModeShape 3 there are two places where content can be stored: representations of the nodes are stored in the Infinispan cache, and (larger) binary values are stored in the BinaryStore. ("Smaller" binary values are stored with the regular node representations; the size limit for storing with the regular nodes or in the binary store is specified in the repository configuration, and defaults to 4kB.) The Infinispan cache and the BinaryStore each have multiple persistence options, and ModeShape's backup mechanism should be independent of these persistence mechanisms.

       

      Basic backup and restore algorithm

      The initial design is that the backup process will simply extract the node representations and binary values and write them to files in a directory on the file system. The node representations are actually Schematic documents, which are in-memory documents that have all the capability of both JSON and BSON, and can easily be written out in either format without loss of information. During a backup, the following steps will be performed:

       

      1. Iterate over the Infinispan cache entries, appending a number (e.g., 1000) of node documents in JSON format to a file. Whenever the maximum number of entries per backup file is reached, the file will be closed a new file will created, and the appending will continue. Note that we'll compress the files; the JSON format will compress quite well. We'll also experiment to determine an acceptable and practical value for the maximum number of entries per backup file, since a larger number will result in fewer but larger files, and a smaller number will result in a greater number of smaller files. UPDATE: By default, 100K nodes will be exported to a single backup file. So, if each node requied about 200 bytes (compressed), the resulting files will be about 19 MB in size.
      2. Write out each of the binary values to a separate file.

       

      We'll use a naming convention and organization within a single directory so that the restore process can simply process all of these files, load them into the new repository's Infinispan cache and binary store.

       

      We think that we can easily make the process work even when the repository is in use, and this will greatly increase the value and lower the invasiveness of the backup process. This will be ideal for normal backup operations, when you simply want to periodically obtain a consistent backup copy; if anything goes wrong, you can restore your repository to the state of the last backup. Of course, if you're migrating a repository (e.g., from 2.x to 3.0, or from a repository with one Infinispan cache store to another repository with a different cache store), you will want to suspend all other repository users so that the backup accurately represents the current state.

       

      Migration from 2.x

      We're planning on providing a separate utility in 2.8.x that can read an existing configuration and write all of the content into a 3.0-compatible backup format. This utility would be run while the 2.x repository no longer being used, and the resulting backup files can then be used to restore the 2.x content into a 3.0 repository. There's also one huge advantage to this approach: if the 2.x backup fails, simply clean up the files on the file system and start the backup process again. (If we were to directly write the 2.x content into a running 3.0 repository without an intermediate format, any failure would mean the 3.0 repository would also need to be cleaned up.)

       

      The APIs

      Where possible, we'd like the actual running repository to be able to perform the backup and restore processes. That makes it far easier for configurations that use external configurations for data sources, etc, and it also means that managing such operations can be included in the general administrative and monitoring activities of the repository.

       

      Although the JCR API doesn't have such repository-level operations, the JCR API does place similar methods (e.g., export and import) on the javax.jcr.Workspace interface. This means that users must first authenticate by obtaining javax.jcr.Session, and the regular session-based authorization mechanism can be used to ensure that only privileged users can invoke the Workspace methods.

       

      We think that this is a good approach for introducing a backup API. ModeShape already provides a org.modeshape.jcr.api.Workspace interface that extends javax.jcr.Workspace (for adding re-indexing methods), so adding a new "backup" method here seems to make sense. This also is compatible with the ability for the backups to be performed while the repository is in use (see above).

       

      However, the mechanism to restore a repository is a bit more challenging, primarily because the restore process will be completely replacing all content over a period of time, and therefore there should be no users of the repository. Also, if the repository is not empty, all of the existing content will need to be removed. I see two options that might work:

       

      Option 1: Specify a "initializeFromBackup" field in the configuration. Currently, when a repository starts up, it looks for some key pieces of information (e.g., repository metadata) and if not there initializes the repository metadata, "/jcr:system" content, indexes, etc. This initialization mechanism could be altered to simply load the content from the backup. Advantages are that the restore fits pretty cleanly into the intialization mechanism, and the repository is not considered "ready for use" until initialization is complete. One disadvantage is that having the "initializeFromBackup" in the configuration file seems a bit forced and unnatural. Another disadvantage is that this would really only work well if the repository were already empty, making it harder for cases where an running repository needs to be restored to a previously backed-up state.

       

      Option 2: Add a "restore" method to the org.modeshape.jcr.api.Workspace interface, along with the "backup" method discussed above. Like "backup", the repository would only allow users with proper privileges to invoke the method. Advantages are that it's fairly symmetric with "backup", it works with the existing privilege mechanism, and that it's easy to first cleaning out any existing content. The primary disadvantage is that the restore cannot be done while there are other users logged in, so we'd need to add a different repository state (other than "running", "starting", "stopping", and "stopped").

       

      Are there other possibilities? How would you want to restore a repository to a previous state, if it were necessary? Thoughts?

        • 1. Re: Backup and restore API
          hchiorean

          Regarding the APIs: imo providing a backup & restore methods on a Repository instance together with a new state (e.g. restoring) would be "cleaner". I don't think we should provide (at least initially) those methods on a Workspace though, as this kind of granularity would make the backup/restore process more complex. I think we should backup/restore an entire repository at a time (as opposed to a single workspace).

          • 2. Re: Backup and restore API
            rhauch

            Regarding the APIs: imo providing a backup & restore methods on a Repository instance together with a new state (e.g. restoring) would be "cleaner".

            I agree that it's more ideal, but it's also less practical: there would be no way to authenticate and authorize the caller. And if we make it a public API, then anyone with a javax.jcr.Repository instance (i.e., everyone) could cast to our public interface and invoke the methods.

             

             

            I don't think we should provide (at least initially) those methods on a Workspace though, as this kind of granularity would make the backup/restore process more complex. I think we should backup/restore an entire repository at a time (as opposed to a single workspace).

            Even though I suggested adding the methods to Workspace, the backup and restore would apply to the whole Repository content. Again, using Workspace is merely a way to provide the functionality to authenticated/authorized users.

             

            This discussion seems awfully similar to one held a while back on the JSR-333 (or informally "JCR 2.1"). In the issue (JSR_333-13), we proposed adding a RepositoryManager interface that is accessed (like the other manager interfaces) from Workspace. Here's the accepted proposal:

             

            /**
             * A <code>RepositoryManager</code> object represents a management view of the Session's Repository instance. This is useful for
             * applications that embed a JCR repository and need a way to manage the lifecycle of that Repository instance. Each
             * <code>RepositoryManager</code> object is associated one-to-one with a <code>Session</code> object and is defined by the
             * authorization settings of that session object.
             * <p>
             * The <code>RepositoryManager</code> object can be acquired using a {@link Session} by calling
             * <code>Session.getWorkspace().getRepositoryManager()</code> on a session object, and the Likewise, the Repository being managed
             * can be found for a given RepositoryManager object by calling <code>mgr.getWorkspace().getSession().getRepository()</code>.
             * </p>
             * 
             * @since 2.1
             */
            public interface RepositoryManager {
                
                /**
                 * Return the <code>Workspace</code> object through which this repository manager was created.
                 * 
                 * @return the @ Workspace} object.
                 */
                Workspace getWorkspace();
            
                /**
                 * Closes the <code>Repository</code> by preventing the creation of new sessions and freeing all resources. The
                 * <code>immediate</code> parameter dictates whether existing sessions should be closed immediately or allowed to close
                 * naturally. Either way, this method always blocks until all sessions are closed and the repository has completely
                 * terminated.
                 * <p>
                 * Some repository implementations may not allow repositories to be closed, while other implementations might allow closing
                 * only for certain configurations (e.g., repositories embedded within an application). An implementation will throw an
                 * {@link UnsupportedRepositoryOperationException} if the particular repository cannot be closed, or an
                 * {@link AccessDeniedException} when the repository can be closed but the session does not have the authority to do so.
                 * </p>
                 * 
                 * @param closeSessionsImmediately true if all existing sessions should be closed immediately, or false if they are to be
                 *        allowed to close naturally.
                 * @throws AccessDeniedException if the caller does not have authorization to close the repository.
                 * @throws UnsupportedRepositoryOperationException if the repository implementation does not support or allow the repository
                 *         to be closed.
                 * @throws RepositoryException if an error occurred while shutting down the repository.
                 */
                void closeRepository( boolean closeSessionsImmediately )
                    throws AccessDeniedException, UnsupportedRepositoryOperationException, RepositoryException;
            }
            

             

            The "Workspace" interface would then have a new method to get the RepositoryManager instance.

             

            I don't suggest adding the "closeRepository(...)" method at this time, but we could still use the RepositoryManager concept and put the backup and restore methods there. That solves the impedance mismatch between the "Workspace" and repository-oriented methods. Plus, it keeps the Session-based authentication/authorization mechanism. Thoughts?

            • 3. Re: Backup and restore API
              hchiorean

              The Workspace - RepositoryManager idea sounds like a good one. I totally agree with need to authenticate the user, but the fact that we only have this option at a Workspace level seems to "constrain us" to write the API like this, which feels unnatural to me in this case.

               

              What should happen if a user wants to back-up a repository which has multiple workspaces, only one of which is accessible to the user ? The user could log in using his credentials on that workspace, and "export" the whole repository anyway. Since most of it is plain text, this might be seen as a security breach. What about adding a dedicated role&permission for the backup/restore which would apply to a whole repository ?

              • 4. Re: Backup and restore API
                rhauch

                What should happen if a user wants to back-up a repository which has multiple workspaces, only one of which is accessible to the user ? The user could log in using his credentials on that workspace, and "export" the whole repository anyway. Since most of it is plain text, this might be seen as a security breach. What about adding a dedicated role&permission for the backup/restore which would apply to a whole repository ?

                I've already added a permission for backup and permission for restore, and these are added to the admin role. Calling either the backup or restore operations without these permissions would cause the methods to throw an exception.

                 

                Oh, and as we discussed earlier, we're only going to provide a way of backing up the whole repository, not just part of it.