Feature Proposals/Deduplicating Asset Service
From OpenSimulator
(→Design 2: Create two separate tables assetsmeta and assetsdata) |
(→Status) |
||
Line 5: | Line 5: | ||
=Status= | =Status= | ||
− | Draft. Everything here is open for change and discussion. | + | Draft and being test implemented. Everything here is open for change and discussion. |
=Proposers= | =Proposers= |
Revision as of 19:55, 1 March 2012
Contents |
Date
November 2011.
Status
Draft and being test implemented. Everything here is open for change and discussion.
Proposers
Justin Clark-Casey (justincc)
Introduction
The asset service stores practically all sim data (textures, scripts, etc.). Many assets are exact duplicates of existing assets, except stored under a different asset ID (and occasionally different metadata).
This feature would be to create an asset service which stores hashes of all assets, enabling duplicate assets to be detected and only one copy stored.
The intention is not to replace existing facilities such as the Simple Ruby Asset Server but rather to raise the level of the default asset service in OpenSimulator, in order to cope with future demands placed on it by upload and other content import mechanisms.
Proposal
A high proportion of uploaded assets, whether uploaded via the viewer or through mechanisms such as OARs and IARs are duplicates of existing assets stored by the asset service. On OSGrid, for example, Nebadon Izumi, OSGrid president, found that 20-25% of all assets were duplicates when the switched over from the default OpenSim asset service to Simple Ruby Asset Server (SRAS).
Here, I propose to establish an asset service that hashes all assets and so detects and eliminates duplicates. This would function along similar lines as coyled's existing SRAS.
Design
There are two major alternatives.
Design 1: Add hash and pointer column to existing asset table
The first design would see a hash column and a pointer column added to the existing asset table. The hash column would be a primary key.
Pseudo-code for adding a new asset
On asset add Hash new asset Compare to existing hashes If match create new asset table entry storing metadata and pointer to existing asset hash else create new asset table entry storing data, hash and metadata
Pseudo-code for retrieving an asset
On asset get select existing asset based on input id If match If asset contains data directly return existing asset else look up asset pointed to by reference returning asset using initial metadata and pointer-referenced data else return no such asset
Design 2: Create two separate tables assetsmeta and assetsdata
The second design would see the creation of two separate tables. The assetsmeta would be
column | type | notes |
---|---|---|
id | char(36) | Primary key |
hash | char(64) | |
name | varchar(64) | |
description | varchar(64) | |
asset_type | tinyint(4) | |
local | tinyint(1) | |
temporary | tinyint(1) | |
create_time | int(11) | |
access_time | int(11) | |
asset_flags | int(11) | |
creator_id | varchar(128) |
This matches the existing assets table except that the data column is no longer present and a sha256 column has been added instead.
The assetsdata table would be
column | type | notes |
---|---|---|
hash | char(64) | Primary key |
data | longblob |
This could be replaced by other storage mechanism options (e.g. filesystem) in the future.
Pseudo-code for adding a new asset
On asset add Hash new asset Compare to existing hashes If match create new assetmeta entry pointing to existing assetdata entry If no match create new asset data entry create new assetmeta entry pointing to new assetdata entry
Pseudo-code for retrieving an asset
On asset get select existing asset based on input id If match Fetch asset data from asset data Return asset metadata + data else return no such asset
Discussion
I vastly prefer a two table design to extending the existing single table. To be honest, the single table design is only included for comparison purposes.
With a single table design one gets columns with no data but pointers to other assets, and assets with data but blank pointers. The point where metadata is retrieved is also non-obvious.
Also, the two table design can cope with asset removal (with the addition of referencing counting to assetsdata). This is extremely difficult with a one table design.
Migration of asset data from the existing asset service might be slightly simpler with the single table design. However, migration is likely to be difficult in both cases (see below).
Migration
Migration of assets is extremely difficult due to the vast number of them.
In fact, I would propose that in this case, asset migration is not done within the OpenSimulator runtime. Rather, a parallel asset service would be created and a separate executable to take an asset from the original asset service and add it to the new one. Migration doesn't need to be done all at once - it can be done gradually over time and then the services switched over.
The de-duplicating asset service would be created as a parallel one to the existing asset service. Once ready, it would become the default, though the older asset service would remain and be deprecated. Only after a long period would the older asset service be removed from OpenSim.
Development
The service will be developed as a new package within OpenSimulator core. The original asset service package will never be altered. Choice of asset service will be done via config as usual.
Initial development of the service may also take place in a separate branch until the data schemas have been sorted out.