[Opensim-dev] Search server DB schema

Dalien Talbot dalienta at gmail.com
Tue Feb 5 01:44:38 UTC 2008


On Feb 5, 2008 12:42 AM, Diva Canto <diva at metaverseink.com> wrote:

> I would hate that OpenSim follows Linden Lab's steps with search without
> taking advantage of the lessons that even they already learned -- that
> relational schemas are not appropriate for modern information retrieval.


I soooo much want to print this sentence and hang it on the wall :-D

I did have an ambicious idea at one time to experiment with the SL search -
and had pulled the info from userpicks of around 1 mln of the SL users, and
pumped that into a mysql table... the queries from that table were not
terribly fast (and yes, I did have an index  :-)



>
> I'll be happy to help setting up this basic search service with Lucene,
> rather than with a relational DB. Lucene is, essentially, a highly
> optimized database for text search. For example, issues like this
> " I'm going to assume name == varchar(63) and description ==
> varchar(127), but it might be easier to just set everything to
> varchar(255) for flexibility."
> are a non-problem in Lucene -- you can use as little or as much text as
> you want in a field, you don't need to hard-code that.


+1

There's an interesting paper:

http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf

which outlines the performance/data volumes we should keep in mind -
although something which seemed to be a winner on this analysis, zettair,
have got me into a bigtime disappointment - the code is sprinkled by
asserts, and there is at least one severe bug (database corruption when
loading the cache) in there. the good news it is BSD licensed, so might be
an interesting candidate :)


>
> Can I help with plugging a Lucene-based search for OpenSim, please? (the
> thought of having a relational DB serving text search makes me shiver :-)


yes, please please :-)

Although as David absolutely correctly points out, Lucene is GPL - so
injecting it into the main BSD-licensed codebase is problematic.

So, we'd need to apply the license workaround and perform the linkage over
TCP :-) (probably HTTP would be the least cumbersome choice)

given that all the items are always centered around some *place*, i think it
might be logical to have the search completely grid-agnostic (although this
does not prevent us in the future to push the "search server data" from UGA
onto the sim, to avoid too much manual configuration)

Also, since potentially one might want to have their sim talk to more than
one search engine, would be useful to keep this in mind

Bot-based crawl is useful - but assumes the flatness of the universe, which
is true for SL - but is no longer true for opensim - at the moment there are
a few parallel universes out there (deepgrid, osgrid, just to name a couple)
- and there appears to be a demand on bringing those together.

So the things become bit more complicated :-) And i think it warrants the
"push" mechanism for the regions to supply the contents to the search
engines, rather than the "pull" mechanism that currently exists.

Another thing to keep in mind for the future - it would be nice to be able
to talk to more than one search engine, while being on the sim (assuming we
have the "multigrid" regions in the future.

ISo, to me it looks like there is a set of functions which would be
characteristic to any search engine being used:

1) region registers the associated objects on it upon coming online

2) region notifies the search engine about it going offline. This can easily
be a no-op for the beginning, but again useful to keep in mind.

3) region talks to the search engine to do the actual search and interprets
its results depending on the type of search (assuming we do implement the
"old-style" searches.)

4) region sends the add-update and delete requests to the search engine
while running, as needed (this is similar to (1) and (2)), and could involve
the stuff like updating the search rankings based onthe userpicks, or just
objects being set for sale, etc.

In the tradition of modularization, it might make sense to abstract these
functions into the region module "search", rather than snapping it directly
atop the DB abstraction layer - since the freetext search engines and the
DBs would have quite different interfaces, IMHO.

Then we might have additional region modules, which would know how to
translate the above, into the engine-specific API (e.g.: database-search,
lucene-search, etc.)

Those would register to the "search" module and take care of talking
potentially to more than one search engine at once. The "search" module
would dispatch the communication to the underlying backend-specific engines,
and aggregate the search results (and return "nothing" in the case there are
no backend-specific plugins loaded).

This approach in my opinion would avoid the lock-in into some specific
search backend, and allow the relatively easy integration with enterprise
search engines for those who have a desire to do so.

David - indeed, the os* functions would be a very good idea, and would
interface to the same API as the (1,2,4) above.

(on a side note, an os* function to actually programmatically perform the
search seems also like a good thing to have)


> I've never participated in an Open Source project as such, so I'm not
> sure how the process is.


"respectful anarchy", as Charles has noted the other day :-)

also, indeed, the IETF principle of "working implementations win" also holds
true :-)

My suggestion to go with the regionmodule approach, is that it might be one
of the most weakly coupled ways to do it - which is almost always a good
thing :-)

Also - this would allow to accomodate the "low-dependency" approach of
database search for local objects and for the case of standalone sim, as
well as allow to have a lucene and what-not-else-based search for larger
scale deployments.

One other reason why it is nice to start with both - they are so different,
that making an abstraction that works for both, would allow to add the
future search backends quite easily, IMHO.

What do you think of such an approach ?

Might be interesting to get together on IRC and discuss it further - what
are your timezones ?

/d




> I did contribute to OS projects before --
> aspectj.org, co-founder, and more recently contributed plugins to XWiki
> with one of my students.
>
> Let me know.
>
> Crista Lopes / Diva Canto
> School of Information and Computer Sciences
> University of California, Irvine
>
>
>
> _______________________________________________
> Opensim-dev mailing list
> Opensim-dev at lists.berlios.de
> https://lists.berlios.de/mailman/listinfo/opensim-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://opensimulator.org/pipermail/opensim-dev/attachments/20080205/60dd79bb/attachment-0001.html>


More information about the Opensim-dev mailing list