Feature Proposals/Deduplicating Asset Service

From OpenSimulator

(Difference between revisions)
Jump to: navigation, search
(Migration)
 
(20 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
November 2011.
 
November 2011.
  
=Status=
+
= Status =
  
Draft. Everything here is open for change and discussion.
+
Draft and being test implemented. Everything here is open for change and discussion.
  
=Proposers=
+
= Proposers =
  
 
Justin Clark-Casey (justincc)
 
Justin Clark-Casey (justincc)
  
=Introduction=
+
= Introduction =
  
The asset service stores practically all sim data (textures, scripts, etc.). Many assets are exact duplicates of existing assets, except stored under a different asset ID (and occasionally different metadata).
+
The asset service stores practically all sim data (textures, scripts, etc.). Many assets are exact duplicates of existing assets, except stored under a different asset ID (and occasionally different metadata).
  
 
This feature would be to create an asset service which stores hashes of all assets, enabling duplicate assets to be detected and only one copy stored.
 
This feature would be to create an asset service which stores hashes of all assets, enabling duplicate assets to be detected and only one copy stored.
Line 19: Line 19:
 
The intention is not to replace existing facilities such as the [https://github.com/coyled/sras Simple Ruby Asset Server] but rather to raise the level of the default asset service in OpenSimulator, in order to cope with future demands placed on it by upload and other content import mechanisms.
 
The intention is not to replace existing facilities such as the [https://github.com/coyled/sras Simple Ruby Asset Server] but rather to raise the level of the default asset service in OpenSimulator, in order to cope with future demands placed on it by upload and other content import mechanisms.
  
=Proposal=
+
= Proposal =
  
A high proportion of uploaded assets, whether uploaded via the viewer or through mechanisms such as OARs and IARs are duplicates of existing assets stored by the asset service. On OSGrid, for example, DATA.
+
A high proportion of uploaded assets, whether uploaded via the viewer or through mechanisms such as OARs and IARs are duplicates of existing assets stored by the asset service. On OSGrid, for example, Nebadon Izumi, OSGrid president, found that 20-25% of all assets were duplicates when the switched over from the default OpenSimulator asset service to [https://github.com/coyled/sras Simple Ruby Asset Server (SRAS)].
  
Here, I propose to establish an asset service that hashes all assets and so detects and eliminates duplicates. This would function along the same lines as coyled's existing [https://github.com/coyled/sras Simple Ruby Asset Server (SRAS)].
+
Here, I propose to establish an asset service that hashes all assets and so detects and eliminates duplicates. This would function along similar lines as coyled's existing SRAS.
  
==Design==
+
== Design ==
  
 
There are two major alternatives.
 
There are two major alternatives.
  
===Design 1: Add hash and pointer column to existing asset table===
+
=== Design 1: Add hash and pointer column to existing asset table ===
  
The first design would see a hash column and a pointer column added to the existing asset table. The hash column would be a primary key.
+
The first design would see a hash column and a pointer column added to the existing asset table. The hash column would be a primary key.
  
 
Pseudo-code for adding a new asset
 
Pseudo-code for adding a new asset
Line 57: Line 57:
 
     return no such asset
 
     return no such asset
  
===Design 2: Create two separate tables assetsmeta and assetsdata===
+
=== Design 2: Create two separate tables assetsmeta and assetsdata ===
  
The second design would see the creation of two separate tables. The assetsmeta would be
+
The second design would see the creation of two separate tables. The assetsmeta would be
  
 
{| border="1"
 
{| border="1"
Line 66: Line 66:
 
  | id          || char(36)    || Primary key
 
  | id          || char(36)    || Primary key
 
  |-
 
  |-
  | sha256      || char(64)    ||
+
  | hash        || char(64)    ||
 
  |-
 
  |-
 
  | name        || varchar(64)  ||
 
  | name        || varchar(64)  ||
Line 72: Line 72:
 
  | description || varchar(64)  ||
 
  | description || varchar(64)  ||
 
  |-
 
  |-
  | assetType  || tinyint(4)  ||
+
  | asset_type  || tinyint(4)  ||
 
  |-
 
  |-
 
  | local      || tinyint(1)  ||
 
  | local      || tinyint(1)  ||
Line 84: Line 84:
 
  | asset_flags || int(11)      ||
 
  | asset_flags || int(11)      ||
 
  |-
 
  |-
  | CreatorID  || varchar(128) ||
+
  | creator_id  || varchar(128) ||
 
  |}
 
  |}
  
Line 94: Line 94:
 
  ! column      !! type        !! notes  
 
  ! column      !! type        !! notes  
 
  |-
 
  |-
  | sha256      || char(64)    || Primary key
+
  | hash        || char(64)    || Primary key
 
  |-
 
  |-
 
  | data        || longblob    ||
 
  | data        || longblob    ||
Line 122: Line 122:
 
     return no such asset
 
     return no such asset
  
===Discussion===
+
=== Discussion ===
  
I vastly prefer a two table design to extending the existing single table. To be honest, the single table design is only included for comparison purposes.
+
I vastly prefer a two table design to extending the existing single table. To be honest, the single table design is only included for comparison purposes.
  
With a single table design one gets columns with no data but pointers to other assets, and assets with data but blank pointers. The point where metadata is retrieved is also non-obvious.
+
With a single table design one gets columns with no data but pointers to other assets, and assets with data but blank pointers. The point where metadata is retrieved is also non-obvious.
  
Also, the two table design can cope with asset removal (with the addition of referencing counting to assetsdata). This is extremely difficult with a one table design.
+
Also, the two table design can cope with asset removal (with the addition of referencing counting to assetsdata). This is extremely difficult with a one table design.
  
Migration of asset data from the existing asset service might be slightly simpler with the single table design. However, migration is likely to be difficult in both cases (see below).
+
Migration of asset data from the existing asset service might be slightly simpler with the single table design. However, migration is likely to be difficult in both cases (see below).
  
==Migration==
+
== Asset service interfaces ==
  
Migration of assets is extremely difficult due to the vast number of them.
+
I do not envisaging the OpenSimulator asset service interfaces (IAssetService) to be changed in any way by this proposal. This means that existing asset services will require no alteration.
  
In fact, I would propose that in this case, asset migration is not done within the OpenSimulator runtime.  Rather, a parallel asset service would be created and a separate executable to take an asset from the original asset service and add it to the new one.  Migration doesn't need to be done all at once - it can be done gradually over time and then the services switched over.
+
== Asset deletion ==
  
The de-duplicating asset service would be created as a parallel one to the existing asset serviceOnce ready, it would become the default, though the older asset service would remain and be deprecatedOnly after a long period would the older asset service be removed from OpenSim.
+
At this stage, the xassetservice will not support asset deletion (though metadata can still be deleted)This could be done if the hash column in the metadata table became a secondary key, if we assume that an asset blob with only one reference is deletable, as the current asset service assumesHowever, only a very small subset of assets get requests for deletion so I do not regard this as a major issue.
  
==Development==
+
== Compression ==
  
The service will be developed as a new package within OpenSimulator core. The original asset service package will never be altered. Choice of asset service will be done via config as usual.
+
Asset data will probably not be compressed. Though it reduces storage size, it also requires CPU time to compress and decompress. As we aren't transmitting data over the wire, network considerations do not come into play.
 +
 
 +
== Migration ==
 +
Migration of assets is difficult due to the vast number of them.  Options are
 +
 
 +
=== Do asset migration entirely outside of OpenSimulator runtimes ===
 +
A parallel xassets service would be created and a separate executable to copy an asset from the original asset service to the new one. Migration doesn't need to be done all at once - it can be done gradually over time (e.g. by periodic running of a migration tool) and then the services switched over via config change.
 +
 
 +
==== Pros ====
 +
* Relatively simple implementation.
 +
* Easy to switch back if something goes wrong and the user has kept their old asset database (albeit with some possible asset loss if new assets are in xassets and for some reason are not recoverable).
 +
 
 +
==== Cons ====
 +
* During migration, two copies of asset data will be in the database.  Therefore, one will need to have double the hard drive space of the existing asset database.  In practice, this should be slightly less for large databases when assets are deduplicated.
 +
 
 +
=== Create a chained asset service ===
 +
An alternative would be for XAssetService to optionally reference a backend vanilla asset database.
 +
 
 +
New assets could go in the xassets tables.  Requests for assets would query xasset db plugin first, then drop back to the older assets db plugin.
 +
 
 +
One could also choose that moment to migrate the asset.
 +
 
 +
==== Pros ====
 +
* Relatively simple for the user.
 +
* No extra space requirement.
 +
 
 +
==== Cons ====
 +
* If something goes wrong then recovery might be harder (although restoration from any database snapshot should be fine).
 +
* Implementation is more complex.
 +
 
 +
== Status ==
 +
The chained asset service is implemented in current git master code.  Might still need some option (or perhaps even a console command) to complete migration.
 +
 
 +
== Development ==
 +
 
 +
The service will be developed as a new package within OpenSimulator core. The original asset service package will never be altered. Choice of asset service will be done via config as usual.
  
 
Initial development of the service may also take place in a separate branch until the data schemas have been sorted out.
 
Initial development of the service may also take place in a separate branch until the data schemas have been sorted out.
 +
 +
== Testing ==
 +
 +
A test implementation is being done in the xassetservice OpenSimulator git branch. You can try this out by building the code and then putting the following config in StandaloneCommon.ini or any other of the *Common ini files in the existing [AssetService] section.
 +
 +
[AssetService]
 +
    LocalServiceModule = "OpenSim.Services.AssetService.dll:XAssetService"
 +
    StorageProvider = "OpenSim.Data.MySQL.dll:MySQLXAssetData"
 +
 +
There is currently only an implementation for MySQL. This implementation is not complete and should not be tried for anything other than testing purposes, it is subject to change without notice in a way that will not be backward compatible.
 +
 +
= Further possible work =
 +
 +
== Hash querying ==
 +
If hashes of assets are stored. then when an asset is uploaded to a simulator, instead of forwarding the uploaded blob to the asset service for hashing and comparison, the simulator could hash the asset itself instead and query the xassetservice as to whether it already has that asset data.  If it does, then the simulator only needs to upload the new metadata rather than the data.
 +
 +
This involves three messages rather than one (=> query data hash, <= reply yes/no, => upload metadata/data) but results in far lower data upload.  Whether this is a worthwhile trade-off is as of yet unknown.
 +
 +
This also involves an extension of the existing asset service interface.

Latest revision as of 16:12, 15 March 2013

Contents

[edit] Date

November 2011.

[edit] Status

Draft and being test implemented. Everything here is open for change and discussion.

[edit] Proposers

Justin Clark-Casey (justincc)

[edit] Introduction

The asset service stores practically all sim data (textures, scripts, etc.). Many assets are exact duplicates of existing assets, except stored under a different asset ID (and occasionally different metadata).

This feature would be to create an asset service which stores hashes of all assets, enabling duplicate assets to be detected and only one copy stored.

The intention is not to replace existing facilities such as the Simple Ruby Asset Server but rather to raise the level of the default asset service in OpenSimulator, in order to cope with future demands placed on it by upload and other content import mechanisms.

[edit] Proposal

A high proportion of uploaded assets, whether uploaded via the viewer or through mechanisms such as OARs and IARs are duplicates of existing assets stored by the asset service. On OSGrid, for example, Nebadon Izumi, OSGrid president, found that 20-25% of all assets were duplicates when the switched over from the default OpenSimulator asset service to Simple Ruby Asset Server (SRAS).

Here, I propose to establish an asset service that hashes all assets and so detects and eliminates duplicates. This would function along similar lines as coyled's existing SRAS.

[edit] Design

There are two major alternatives.

[edit] Design 1: Add hash and pointer column to existing asset table

The first design would see a hash column and a pointer column added to the existing asset table. The hash column would be a primary key.

Pseudo-code for adding a new asset

On asset add
  Hash new asset
  Compare to existing hashes
  If match
    create new asset table entry storing metadata and pointer to existing asset hash
  else
    create new asset table entry storing data, hash and metadata


Pseudo-code for retrieving an asset

On asset get
  select existing asset based on input id
  If match
    If asset contains data directly
      return existing asset
    else
      look up asset pointed to by reference
      returning asset using initial metadata and pointer-referenced data
  else
    return no such asset

[edit] Design 2: Create two separate tables assetsmeta and assetsdata

The second design would see the creation of two separate tables. The assetsmeta would be

column type notes
id char(36) Primary key
hash char(64)
name varchar(64)
description varchar(64)
asset_type tinyint(4)
local tinyint(1)
temporary tinyint(1)
create_time int(11)
access_time int(11)
asset_flags int(11)
creator_id varchar(128)

This matches the existing assets table except that the data column is no longer present and a sha256 column has been added instead.

The assetsdata table would be

column type notes
hash char(64) Primary key
data longblob

This could be replaced by other storage mechanism options (e.g. filesystem) in the future.

Pseudo-code for adding a new asset

On asset add
  Hash new asset
  Compare to existing hashes
  If match
    create new assetmeta entry pointing to existing assetdata entry
  If no match
    create new asset data entry
    create new assetmeta entry pointing to new assetdata entry

Pseudo-code for retrieving an asset

On asset get
  select existing asset based on input id
  If match
    Fetch asset data from asset data
    Return asset metadata + data
  else
    return no such asset

[edit] Discussion

I vastly prefer a two table design to extending the existing single table. To be honest, the single table design is only included for comparison purposes.

With a single table design one gets columns with no data but pointers to other assets, and assets with data but blank pointers. The point where metadata is retrieved is also non-obvious.

Also, the two table design can cope with asset removal (with the addition of referencing counting to assetsdata). This is extremely difficult with a one table design.

Migration of asset data from the existing asset service might be slightly simpler with the single table design. However, migration is likely to be difficult in both cases (see below).

[edit] Asset service interfaces

I do not envisaging the OpenSimulator asset service interfaces (IAssetService) to be changed in any way by this proposal. This means that existing asset services will require no alteration.

[edit] Asset deletion

At this stage, the xassetservice will not support asset deletion (though metadata can still be deleted). This could be done if the hash column in the metadata table became a secondary key, if we assume that an asset blob with only one reference is deletable, as the current asset service assumes. However, only a very small subset of assets get requests for deletion so I do not regard this as a major issue.

[edit] Compression

Asset data will probably not be compressed. Though it reduces storage size, it also requires CPU time to compress and decompress. As we aren't transmitting data over the wire, network considerations do not come into play.

[edit] Migration

Migration of assets is difficult due to the vast number of them. Options are

[edit] Do asset migration entirely outside of OpenSimulator runtimes

A parallel xassets service would be created and a separate executable to copy an asset from the original asset service to the new one. Migration doesn't need to be done all at once - it can be done gradually over time (e.g. by periodic running of a migration tool) and then the services switched over via config change.

[edit] Pros

  • Relatively simple implementation.
  • Easy to switch back if something goes wrong and the user has kept their old asset database (albeit with some possible asset loss if new assets are in xassets and for some reason are not recoverable).

[edit] Cons

  • During migration, two copies of asset data will be in the database. Therefore, one will need to have double the hard drive space of the existing asset database. In practice, this should be slightly less for large databases when assets are deduplicated.

[edit] Create a chained asset service

An alternative would be for XAssetService to optionally reference a backend vanilla asset database.

New assets could go in the xassets tables. Requests for assets would query xasset db plugin first, then drop back to the older assets db plugin.

One could also choose that moment to migrate the asset.

[edit] Pros

  • Relatively simple for the user.
  • No extra space requirement.

[edit] Cons

  • If something goes wrong then recovery might be harder (although restoration from any database snapshot should be fine).
  • Implementation is more complex.

[edit] Status

The chained asset service is implemented in current git master code. Might still need some option (or perhaps even a console command) to complete migration.

[edit] Development

The service will be developed as a new package within OpenSimulator core. The original asset service package will never be altered. Choice of asset service will be done via config as usual.

Initial development of the service may also take place in a separate branch until the data schemas have been sorted out.

[edit] Testing

A test implementation is being done in the xassetservice OpenSimulator git branch. You can try this out by building the code and then putting the following config in StandaloneCommon.ini or any other of the *Common ini files in the existing [AssetService] section.

[AssetService]
   LocalServiceModule = "OpenSim.Services.AssetService.dll:XAssetService"
   StorageProvider = "OpenSim.Data.MySQL.dll:MySQLXAssetData"

There is currently only an implementation for MySQL. This implementation is not complete and should not be tried for anything other than testing purposes, it is subject to change without notice in a way that will not be backward compatible.

[edit] Further possible work

[edit] Hash querying

If hashes of assets are stored. then when an asset is uploaded to a simulator, instead of forwarding the uploaded blob to the asset service for hashing and comparison, the simulator could hash the asset itself instead and query the xassetservice as to whether it already has that asset data. If it does, then the simulator only needs to upload the new metadata rather than the data.

This involves three messages rather than one (=> query data hash, <= reply yes/no, => upload metadata/data) but results in far lower data upload. Whether this is a worthwhile trade-off is as of yet unknown.

This also involves an extension of the existing asset service interface.

Personal tools
General
About This Wiki