Feature Proposals/AutoBackup

= Automatic Region Backup =

Basic Info

 * Summary: A new module for periodically saving per-region backup OARs according to rules defined in Regions.ini.
 * Developer Sponsor(s): allquixotic, justincc
 * Start Date: February 18, 2011
 * Status: Committed to git master as of May 12, 2011, Experimental
 * Branches Targeted: git master
 * Code Repository: Maintained in OpenSimulator git master

Idea
Currently, saving OARs in an organized manner requires significant hackery in the way of scripts or other external wrappers. This project is an attempt to get automatic OAR backup as a core capability of OpenSimulator.


 * On a Per-Region basis, we want to give the user the opportunity:
 * To enable/disable automatic periodic backups of their region to an OAR.
 * To specify the length of time between automatic backups.
 * To invoke an external shell script or binary when a backup is taken (e.g. to transport to an offsite backup over FTP or S3).
 * To choose between overwriting a single file over and over, or saving a series of files whose name includes the timestamp of the OAR.
 * To enable or disable some simple logic that attempts to determine whether the sim is "busy", in which case no backup will be taken (because starting a backup can cause performance degradation).

Just to emphasize, all of these choices will be available completely separately for each region, regardless of whether the regions run in multiple instances or in one instance.

Ideally, users will be able to customize these settings in Regions.ini to conveniently specify when to take backups, and for which regions.

Criteria

 * Scope: The primary focus of the feature will be a new region module that keeps track of when regions need to be OARed, and other various state as gleaned from Regions.ini.
 * Software Requirements: No additional third-party dependencies or version bumps will be required.
 * Impact: Only the OpenSimDefaults.ini has to be modified to add the disable by default setting, outside of the AutoBackupModule code itself.
 * Blockers: No blockers yet, but I'm sure as time goes on...

Implementation Overview

 * Proposed Module Name: IRegionAutoBackupModule
 * Functionality Pseudocode:
 * Parse out the options from Regions.ini for each region
 * Set timers to wake up when the elapsed time for each region has expired. Optimization: Merge timers into one with a list of regions to process if the time is the same. Cuts down on the number of timers.
 * In a timer handler:
 * If the user wants us to check for busy conditions, then go through the conditions and make sure they all pass. Haven't determined what those conditions should be yet, but an obvious start is to set a reasonable avatar threshold, and examine the sim time dilation. If there are either "a lot" of avatars, or "a very low" time dilation, we probably don't want to kill the sim with an oar backup right now. But the user can bypass our logic if they want. Maybe we could add yet more options for the thresholds themselves, if we settle on e.g. time dilation and avatar count.
 * If we checked some conditions and one or more of them failed, put off the oar backup for (interval / 2) seconds and then try again, where "interval" is the original interval specified in Regions.ini, not the interval we were just asleep for.
 * Else: 1. Get the Scene object for the region(s) associated with this timer; get the IRegionArchiverModule interface; call HandleSaveOarConsoleCommand (or one of the other functions in the module?). The file name will either be the name of the region + ".oar", or the name of the region + a friendly-formatted timestamp + ".oar".
 * 2. Invoke the shell script associated with the region's backup event, as given in Regions.ini. Possible security issue, but then, allowing sysadmins to set threat level severe is not exactly secure...
 * 3. Reset the timer.

The only functions that other modules need to call into this one will be the initialization stuff: creating the implementation object and setting it up to do its work. Then it basically runs in an infinite loop, periodically doing stuff. I anticipate being able to do everything in a single thread, but I'm not sure how threading is worked into the other modules. In particular, I need to know if IRegionArchiverModule is thread-safe, or if I need to enter into a specific thread or take some locks to use it.

User Experience
This shouldn't be visible to end-users at all, but it will be very visible to OpenSimulator sysadmins. Sysadmins will edit Regions.ini to manage the config settings on a per-region basis. Maybe, for user-friendliness, we can have global config options in OpenSim.ini that serve as "default" options for the ones in Regions.ini, if they are not specified. The default will be to disable auto-backup altogether, but if a user enables auto-backup without specifying any other auto-backup config options, it'd be nice if they could set the rest of the options globally in OpenSim.ini. This would reduce the size of Regions.ini for very large numbers of regions, and make it trivial to merge the timers into one.

Configuration Settings

 * Global (in OpenSim.ini under [AutoBackupModule] section)
 * AutoBackupModuleEnabled: True/False. Default: False. If False, every function in the module is as no-op as possible: just return as soon as realizing that we're not enabled. Otherwise it will try to get as far as it can with auto backup for each region.
 * Global (in OpenSim.ini) or Per-Region (in Regions/Regions.ini under the region's name's section)
 * IMPORTANT: Settings declared per-region in Regions/Regions.ini override settings in OpenSim.ini. Settings in OpenSim.ini, in turn, override hard-coded defaults.
 * AutoBackup: True/False. Default: False. If True, activate auto backup functionality. This is the only required option for enabling auto-backup; the other options have sane defaults. If False, the auto-backup module becomes a no-op for the region, and all other AutoBackup* settings are ignored.
 * AutoBackupInterval: Integer, non-negative value. Default: 720 (12 hours). The number of minutes between each backup attempt. If a negative or zero value is given, it is equivalent to setting AutoBackup = False.
 * AutoBackupBusyCheck: True/False. Default: True. If True, we will only take an auto-backup if a set of conditions are met. These conditions are heuristics to try and avoid taking a backup when the sim is busy.
 * AutoBackupScript: String. Default: not specified (disabled). File path to an executable script or binary to run when an automatic backup is taken. argv[1] of the executed file/script will be the file name of the generated OAR. If the process can't be spawned for some reason (file not found, no execute permission, etc), write a warning to the console.
 * AutoBackupNaming: string. Default: Time.
 * One of three strings (case insensitive):
 * "Time": Current timestamp is appended to file name. An existing file will never be overwritten.
 * "Sequential": A number is appended to the file name. So if RegionName_x.oar exists, we'll save to RegionName_{x+1}.oar next. An existing file will never be overwritten.
 * "Overwrite": Always save to file named "RegionName.oar", even if we have to overwrite an existing file.
 * AutoBackupDir: String. Default: "." (the current directory).	A directory (absolute or relative) where backups should be saved. If the path is not a directory, or insufficient permissions are available, a warning will be printed to the console and no backups will be taken.
 * AutoBackupDilationThreshold: float. Default: 0.5. Lower bound on time dilation required for BusyCheck heuristics to pass.
 * If the time dilation is below this value, don't take a backup right now.
 * AutoBackupAgentThreshold: int. Default: 10. Upper bound on # of agents in region required for BusyCheck heuristics to pass.
 * If the number of agents is greater than this value, don't take a backup right now.

Busy Heuristics
PROCRASTINATE means "don't save an OAR right now; wait AutoBackupInterval / 2 minutes and then try again." Implementation note: As of May 2, we don't halve the interval each time we PROCRASTINATE. Not sure if we need to worry about this. PROCEED means "try the next heuristic" -- all heuristic conditions must evaluate to "PROCEED" to actually save an OAR.


 * Ideas for busy heuristics include:
 * Is the Time Dilation at the present time < 0.50? If so, PROCRASTINATE. Otherwise, PROCEED. At the risk of option bloat, introduce AutoBackupDilationThreshold to allow the user to set the minimum time dilation required for a PROCEED on this heuristic. Implementation difficulty: Low. Performance cost: Low.
 * Are there more than 10 Main Agents on the region right now? If so, PROCRASTINATE. Otherwise, PROCEED. At the risk of option bloat, introduce AutoBackupAgentThreshold to allow the user to set the maximum number of agents that can be present for a PROCEED on this heuristic. Implementation difficulty: Low. Performance cost: Low.
 * Avatar "density" (calculated by the average proximity between avatars on the sim)? This seems expensive for a test to avoid a big performance hit, so we may not want to do this. But this would be a pretty good heuristic for smartly detecting meetings and events, since that usually involves avatars in relatively close proximity. Implementation difficulty: Medium. Performance cost: Unknown/High.

Reducing Impact Of OAR Backup
This is slightly out of the scope of this feature, but it could certainly ease acceptance of this feature and make the general practice of saving OARs more acceptable on busy sims, which would in turn reduce the need for the heuristics, and allow more users to set AutoBackupBusyCheck = False and not worry about it. Here are a few brainstormed ideas for improving the performance of OAR saving; this might even belong in its own feature proposal:


 * Faster compression. Use a lower compression ratio for Gzip, or use a faster algorithm like LZO. Reference credible papers on lossless compression performance; find the codec with a compatibly-licensed implementation (BSD/MIT) with the least compression performance cost; implement. Changing the OAR file format might require some kind of transition, e.g., detect which format the `load oar' is in, and automatically support both Gzip and the new format (whichever that turns out to be). That shouldn't be too hard with the use of extremely conspicuous file format magic numbers that are very common with these formats.
 * JSON instead of XML. I've heard from sources in the embedded world that you can save quite a lot of CPU cycles by using JSON instead of XML, because there are fewer and less complex structural elements to parse (on the read) and generate (on the write).
 * Faster asset serialization -- this is probably where a lot of the time is consumed, writing out assets. We should really profile this code and figure out where most of the time is spent. Are images being encoded into compressed file formats, and then compressed again in the OAR? Note that compressing already-compressed data is a spectacular waste of time, because you bring out worst-case behavior in the outer compression codec as it tries to extract a little more entropy out of the already-compressed data. Usually it ends up making the file size even larger due to file format overhead.
 * My recommendation is that we leave already-compressed assets as-is and just tar them up together, and have a .gz file inside the tar with the compressible data, like XML or JSON. This will minimize file size and possibly speed up compression, since the outer compression codec won't have to sweat the high-entropy compressed data.
 * Use the operating system scheduler to our advantage. Instead of writing the GZipStream data as quickly as possible, use asynchronous writes, and put some yields in at appropriate places. This would effectively make the archive process take longer, but give up some scheduling slots for other threads to execute. The other option would be to offload the most resource-intensive parts of the operation to a separate process (e.g. compression), and set that process's scheduling priority to minimal. The improvement you'd see from this would be extremely dependent on the OS: on Windows, starting a process has an extremely high overhead, especially with virus scanners and having to load another instance of .NET. Recent Linux 2.6 is much more efficient at sharing data and spawning processes quickly, and low priority scheduling has a noticeable impact on resource usage on Linux.
 * Amortize the cost of preparing data for archiving by using spare cycles to prepare the assets' memory stream. This could open up a whole different opportunity for activity-based backup scheduling: when the region is at a particularly low point of activity, you could just take a backup whenever you are able to reach a high level of confidence that things are not busy.