[Opensim-dev] Fw: Re: Search server DB schema

Diva Canto diva at metaverseink.com
Tue Feb 5 01:54:57 UTC 2008


I'll try to get on the IRC later tonight. I'm in California.

There is a Lucene.NET in Apache Incubator, but it seems to have been 
suspended for lack of time of the developer. Last news is 4/2007.
I have only used the original Java implementation of Lucene, and I 
wouldn't recommend any other unless it was clear that it was actively 
maintained at Apache. One time I tried the Ruby and Python ports, and 
that was a bad experience.

Is there a policy that OpenSim should only be C#? (I like C#, but I go 
where libraries take me)
Search (for users) is a relatively independent component of any system, 
it doesn't need to be coupled at all. Let me explain how Lucene (and 
Google, for that matter) works, you'll see that it's a bit different 
from the model in relational DBs.

Lucene is, essentially, an index of words. There are two parts to it:
(a) the collection and indexing of the words, which are scrapped from 
text sources (any sources, including relational DBs); this produces an 
"index" file; the index can have many "fields", sort of like a DB, and 
you can associate weights with each field -- for example the words in 
"title" are usually more important than the words in the "description"  
-- and with each document (a document is similar to a DB "row", i.e. 
it's a collection of words possibly divided by fields).
(b) the search over this index, which is blazing fast and fairly 
expressive.

Lucene has APIs for updating the index after it has been built, but, 
unless the updates are simple, they tend to screw up the optimizations, 
so people usually generate the whole index regularly (e.g. once a day, 
every couple of hours, or whatever fits the needs). It is possible that 
the latest release of Lucene has improved that. But the basics is this: 
scrape words, produce an index file, have the search use that index.

So that you have an idea of how this works in concrete, our collection 
of LL's data is about 2 million "documents" (document = product or 
parcel or sim or notecard). The bots generate XML representations of 
things, which I understand is what you are doing too. Here's an example:

<sim ...>
  <item type="place" scripts="yes" listed="yes" pcat="Shopping" 
prims="398" products="142">
    <LLUUID>d953e34a-f6b4-5317-b1bb-490546db8ae5</LLUUID>
    <title>Patchworks Main Store</title>
    <description>Bliss (wedding), The Daisy Patch (garden), costumes, 
hats, shoes, maternity, women's wear, animations by M & M,  SL 
Exchange</description>
    <location>Abydos/113/57/28</location>
    <owner>Abydos Ventures </owner>
    <area>29344</area>
    <image>0432aa34-5e1c-f864-8b44-8db507326da0</image>
  </item>
</sim>

It takes about 45 minutes to generate the index from these XML 
representations stored in 15,000+ XML files (one per sim), on a machine 
with 2M RAM.
Lucene is scalable to billions of documents and, if needed, it can also 
be partioned among different servers.

I'll talk more later.

David Wendt JR. wrote:
> Yahoo mail sent the reply to diva's e-mail instead of the whole ML. 
> I've forwarded the original response below.
>
> ----- Forwarded Message ----
> From: David Wendt JR. <dcrkid at yahoo.com>
> To: diva at metaverseink.com
> Sent: Monday, February 4, 2008 7:16:50 PM
> Subject: Re: [Opensim-dev] Search server DB schema
>
>     I read up on Lucene. It's Java, made by Apache and Apache 
> licensed. Only question, is there a .NET/Mono/C# client library 
> available? One that's BSD compatible? If we have that we can start 
> work on Lucene. As for your comments about relational not being good 
> for search... you might be right. Look how many times Old Search would 
> stop working... they eventually said "screw it" and went with a Google 
> Search Appliance, which works a lot better IMHO.
>
>     Anyway, looking at OpenSim.Framework.Data it may be possible to 
> have our cake and eat it too, provided said cake isn't a lie. Opensim 
> already has a good database abstraction framework that looks like it 
> could also support non-SQL/non-relational databases if we so desire. 
> All we would really need to do is write another database plugin for 
> Lucene. At first glance this seems like a very good idea: we can push 
> forward with Search and then write the plugin for the high-performance 
> Lucene search later.
>
>     Devs, what do you think of that?
>
>     UNRELATED NOTE: When we implement search, specifically HTTP 
> Search, we should have an osFunction that lets vendors publish their 
> contents to search.
>
> ----- Original Message ----
> From: Diva Canto <diva at metaverseink.com>
> To: opensim-dev at lists.berlios.de
> Sent: Monday, February 4, 2008 6:42:26 PM
> Subject: Re: [Opensim-dev] Search server DB schema
>
> Hello opensim-developers,
>
> First of all, thanks for putting opensim together, this is what we all
> needed! I just signed up for this list, so apologies if my comments are
> out of place. I did set up my own OpenSim, and it's great! (in spite of
> all the fights i've been having with mono on Mac, but that's another 
> thread)
>
> Over the past 10 months or so, a colleague and I have developed an
> independent search engine for LL's SL, which can be accessed here:
> http://slbrowser.com The engine does not require special access to grid
> databases, it uses bots to collect inworld information, one sim at a
> time. We use libsecondlife. It has been working continuously for the
> past 6 months; we crawl the grid twice a week with only 14 bots, and
> have been able to find appropriate heuristics for many things.
>
> In general search can be looked at in two ways, and these are *not*
> incompatible: it can be a basic administrative function -- you want to
> know the data you serve; or it can be a basic user function -- you want
> people to be able find things. The first type of search is really
> simple: stick a DB, and you've solved it. This works fairly well for
> small amounts of data, and for data that is fairly constrained. The
> second type of search is a lot more powerful, but it's not so simple,
> because you want to rank the huge amount of results in a semantically
> meaningful way. With SLBrowser, we've followed the latter. Crawling the
> live sims gives us a lot of useful aggregate information that we use to
> experiment with ranking in much more interesting ways. We don't use a
> relational DB on the backend, we use Lucene.
>
> I would hate that OpenSim follows Linden Lab's steps with search without
> taking advantage of the lessons that even they already learned -- that
> relational schemas are not appropriate for modern information retrieval.
> I'll be happy to help setting up this basic search service with Lucene,
> rather than with a relational DB. Lucene is, essentially, a highly
> optimized database for text search. For example, issues like this
> " I'm going to assume name == varchar(63) and description ==
> varchar(127), but it might be easier to just set everything to
> varchar(255) for flexibility."
> are a non-problem in Lucene -- you can use as little or as much text as
> you want in a field, you don't need to hard-code that.
>
> Can I help with plugging a Lucene-based search for OpenSim, please? (the
> thought of having a relational DB serving text search makes me shiver :-)
> I've never participated in an Open Source project as such, so I'm not
> sure how the process is. I did contribute to OS projects before --
> aspectj.org <http://aspectj.org>, co-founder, and more recently 
> contributed plugins to XWiki
> with one of my students.
>
> Let me know.
>
> Crista Lopes / Diva Canto
> School of Information and Computer Sciences
> University of California, Irvine
>
>
>
> _______________________________________________
> Opensim-dev mailing list
> Opensim-dev at lists.berlios.de <mailto:Opensim-dev at lists.berlios.de>
> https://lists.berlios.de/mailman/listinfo/opensim-dev
>
>
> ------------------------------------------------------------------------
> Never miss a thing. Make Yahoo your homepage. 
> <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs>
>
>
> ------------------------------------------------------------------------
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try 
> it now. 
> <http://us.rd.yahoo.com/evt=51733/*http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ%20> 
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Opensim-dev mailing list
> Opensim-dev at lists.berlios.de
> https://lists.berlios.de/mailman/listinfo/opensim-dev
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://opensimulator.org/pipermail/opensim-dev/attachments/20080204/2a42d20c/attachment-0001.html>


More information about the Opensim-dev mailing list