[Opensim-dev] Concerning recent grid instability and inconsistency

Fri Feb 29 03:08:13 UTC 2008

inline responses follow. ;) Please use a block of salt as you read my
opinions.

On Fri, 2008-02-29 at 01:27 +0100, Dalien Talbot wrote:
> Hiro,
> 
> excellent post. 
> comments inline.
> 
> On Wed, Feb 27, 2008 at 6:59 PM, James Stallings II
> <james.stallings at gmail.com> wrote:
>         
>         Early this morning there was (yet another) instance where
>         certain regions could be readily accessed by some and could
>         not be accessed at all by others.
>         
>         This led to an insightfull discussion on IRC this morning,
>         which I will attach for your convenience.
>         
>         Here are some high points:
>         
>         Various users were able to access WP reliably last night while
>         others were completely unable to do so. The consistent
>         relevant factor was a high ping time from the affected clients
>         to the region server in question. A properly executed
>         traceroute revealed that the network fault lay with level3
>         networks, between the affected clients and the region server
>         in question.
>         
>         recently I was graced with the company of many new neighbors
>         on the grid. With the new neighbors came complete instability
>         with respect to my region server. Some of the instabilities
>         were my own doing. Fixing them did not completely address the
>         problem. A subsequent relocation to a neighbor-free location
>         on the grid did.
> 
> I wonder if this is inter-region comms, or region-grid comms, or
> region-client comms that were the issue - do we know by chance what of
> them were the culprits ? Or the sequence of events that would help to
> pinpoint this ?

We really need to do some work in isolating the source of the problem
before we attempt to solve it. Otherwise we're shooting in the dark and
probably rewriting things that don't need rewriting.

I do have good resources available to do some network diagnostics if you
want to plug a region into my grid and debug that way. If so feel free
to ping me online or via email to setup a plan for tracking this beast
down.

>  
>         
>         one of our newer contributors has audited much of the
>         low-level packet handling code and found it to be in want of
>         some rearchitecting in the interest of efficiency. This is a
>         salient point; if the communications code upon which the grid
>         relies for region<>region and client<>region communications is
>         not optimum, we cannot count on operations to be optimum. This
>         also has profound implications for troubleshooting, as all
>         diagnostic data arrives via the conduits implemented by this
>         low-level comms code.
> 
> 
> I thought region<->region was not working by means of packet handling,
> and it was primarily for region<->client comms - do we have any more
> specifics ?

Region <-> region is handled via .NET remoting so far as I know (aka
XMLRPC). Only the region <_> client connection uses udp packet handling.
As such, there shouldn't be any evilly complex packet switching to cause
issues here. What I do see having problems with is the remoting code
being cpu heavy or inefficient in speed or validation. From a VERY
cursory glance over the code to date I didn't see very much error
catching logic. 8-( There is definitly much cleaning and security yet to
be applied to the remoting methods.

>  
>         
>         adjacent misconfigured regions have been demonstrated to have
>         a significant impact on a given region's stability. Something
>         needs to be done to address this.
> 
> again, first thing to do imho is to find out what is the trigger.
>  
> (I will save my usual rant that the "centralized" model is not
> scalable in a "large" (as in "millions of regions" deployment model -
> we did have some discussion with Ahzz on this but did not have the
> time yet to make the sifnificant progress... The only thing I did is
> for now importing a bsd-licensed DNS resolver code - we were agreeing
> that (ab)using the TXT records in DNS is certainly a good way to go
> towards a more scalable system.

:) Yes, there does need to be a way to fragment a grid without relying
on artificial constructs as a routing reverse http proxy, or load
blancing SQL partitioning magic. I've never understood why linden labs
didn't take steps to fragment their infrastructure a long time ago
instead of taking the monlithic approach that has proven time and again
to impose severe performance and scalability limitations one somethign
that is supoed to "replace" the majority of the internet. ;) One would
have thought that this feature would have been designed in from the
start. But hey, no one is perfect eh?

> 
> 
>         
>         I will present some talking points in terms of these roles and
>         their goals.
>         
>         The set of roles and goals of the OSGrid may be summarized as
>         a brief mission statement which I characterise as follows:
>         
>                 OSGrid exists as a multirole support operation
>                 supporting and conducting testing pursuant to the
>                 development of OpenSim virtual reality/telepresence
>                 software. Further, it exists to provide a free parking
>                 service for the development and testing community,
>                 both for pure testing and content hosting. Finally,
>                 OSGrid exists to develop and drive development of
>                 grid-related tools and/or code, whether as
>                 freestanding utilities or source for contribution to
>                 the codebase of OpenSim.
>         
>         
>         Pursuant to fulfillment of these roles, a course of action
>         needs be devised which facilitates all of these objectives
>         simultaneously. Here is a plan which seeks to accomplish this.
>         Note that some key points are short term; others, less so:
>         
>         Packet handling technology critique. Involve directly those
>         who have worked on this code in the past with those who are
>         interested in carrying the work forward. Get either a
>         commitment from it's original authors to return to active
>         development, or alternatively an informal hand-off of the work
>         to this group. Then perform a systematic audit on this code,
>         as a group. Document it as we go. Explaining how code works to
>         a piece of paper can show up just as many flaws as a trial
>         run. There's the side benefit of generating thorough
>         documentation.
> 
> 
> again - packet handling, afaik, pertains largely to client<->region
> communications - please correct me if I am wrong. Indeed this should
> become the area of focus if we determine it is the botteneck, but I
> would think the troubles like in the region<->region, and
> region<->grid server comms, which are done by other means.

There's many things that can be done to improve the efficiency. I do
have to point out that we need stability before efficiency. The
UserThreadPool branch was done as a means of freeing up threads and
reducing context switching on heavy use regions so that we could then
concentrate on the annoyign bugs that it uncovers through exposing
multi-thread access conflicts, which I was guessing is the core of MANY
issues that still remain. Fortunately thread concurrent access is being
heavilly worked on by Ter and others. :) 

> 
>         
>         One sequence of events currently in progress with respect to
>         this, is that several of our commercial contributors are about
>         to introduce a major set of stability and load balancing
>         patches. This set of patches might well replace or repair the
>         affected code, so this may in fact become a non-issue.
>         
>         This does not mean we can afford to sit and wait for them to
>         make the drop. What we can take immediate action on, is to
>         prepare to profile the performance impact in quantitative ways
>         once this code has been dropped. In this interest, ChrisD will
>         be placing some regions alongside mine out in the open; our
>         intention is to maintain a seperate 'island', where we can
>         excercise fine-grained control over who our neighbors are (and
>         more importantly, fine-grained control of the neighbor
>         configuration) for testing purposes. 
> 
> yes - so pertaining to my comments above - narrowing down what exactly
> is the trigger for the slowness is the key.

Sounds like a plan to me. Let us know what develops. I would go a step
further and use a firewall between your regions to introduce artificial
drops, latency, etc to see if things are beeing too sensitive in that
reguard.

>  
>         
>         We also hope to be able to prove our theory about brokenly
>         configured adjacent regions. If you bring up a region adjacent
>         to us, we will likely ask you to move it or remove it
>         ourselves if we are unable to reach you.
>         
>         Set up an automated vetting process for incoming regions on
>         the grid. Say, submit region xml files via HTTP/POST, and have
>         them programatically checked for syntactical correctness,
>         geographical collisions, network reachability, port
>         configuration consistency, completness of information, etc.
>         before allowing a final connection with grid services. 
> 
> +1. Once we understand what kind of misconfigurations are causing the
> issues - maybe we should go even further and address the root causes
> of them, rather than checking the configuration.

Hmm. I don't know if I would do that externally. The grid server should
be rejecting outright any misconfigurations itself since it IS the
master of the grid after all. The heartbeat code could be used to
perform an initial reachability test.

>  
>         
>         This also introduces a logical point for a payment
>         collection/verification process for that day when we have a
>         salable service, for those who are interested in providing
>         for-pay grid services at that time.
> 
> This would provide a roadblock. The payment should still be a hook
> outside of the "automated" process. (like, providing a crypto token of
> sorts, which would be checked at the time of the region coming up)

*nod* If you want to controll grid connectivity based on payment status
then you I think it would be more apropriate to make use of the existing
send and recv keypairs in the regions table. the region_login RPC method
does check these and it would be rather simple for your external website
to change the keypair so that the customer no longer has the valid pair
and thus can't hook up to the grid.

I do like the idea of usign the keypair as a set of pub/priv encryption
keys in order to secure our inter-region communications content. An
alternative is to use them as x.509 certs to compare on the ssl layer.

>  
>         
>         Establish an effective heartbeat mechanism between region
>         servers and grid server. If the heartbeat goes away, the
>         region table entry is appropriately updated to reflect the
>         region state, and except for geographic purposes and region
>         heartbeat response, the region would be ignored by grid
>         services (as if it were not there). If/when the region
>         produces new signs of life, it would then be returned to
>         active participation with grid services. This could be taken a
>         step further with longterm monitoring of the region
>         heartbeat. 
> 
> I think Ahzz had some patch in mantis and his git repo on this. Since
> I am not much into centralized grid comms, I did not commit it - I 
> think it was introducing some changes to the SQL tables...

Yep. It added the basics of online and lastSeen to the regions table
specificaly for osgrid's use in trackign which regions have logged in
recently and which are awol for an extended period.

>  
>         
>         Logic is implied that after some configurable long period
>         without a heartbeat the region could be made available for
>         purging from the tables completely, either automatically or
>         with operator intervention. The benefits in bandwidth and
>         cycles recovered from retry requests and threads locked while
>         waiting on unresponsive regions would far exceed the
>         relatively minor hit taken in producing and servicing a
>         heartbeat. 
> 
> I did start early experiments with using the DHT as a "grid service" -
> I've installed a couple of bamboo DHT instances. Although obviously
> this drags a few other issues (like the questions of "trust" between
> the operators of instances of DHT, etc. Maybe having the DNS records
> that are periodically updated, could be a better solution. We were
> discussing that kind of stuff with Ahzz.. Again, the time issue :(

I still haven't looked at bambo. 8-( My primary concerns are load and
response times, as well as record concurrency between replicated nodes
on updates. DNS is already a beautifully distributed system. However
it's not so hot on concurrency updates. I was thinking of DNS more for
the "where can I get X?" questioning instead of asset_server =
http://blah.blah.blah:9003/" type config file entries.

On the note of the heartbeat I do want to add a way for regions to
report to the grid server that another region was "unresponsive" so that
the grid server can then say to any inquiring regions "oh, 5 of 5 tries
died, hmm, it's dead Jim! Pass the red marker!" and get that bad boy
marked as offline and stop sending it out as a living and breathing
region.

> 
> My personal opinion is that having a "pay-to-join-the-grid" is the
> wrong business model as it forces the competition between the grid
> operators. It should be "pay-to-store-your-identity" or
> "pay-to-store-your-assets-and-other-junk" or "pay-to-store-your-sim" -
> and the architecture should be driven around that, rather than the
> central control point (aka single point of failure:) that the grid
> services really are.

This is where i see things happening as well. I suspect that
pay-to-connect grids are going to be more of a flash in the pan type
event than a real business model in the future. I also think there will
be several orders of magnitude more money in simply hosting the data or
services than in providing a gated colony. The people that will really
win will be the ones that provide a grid that's interconnected with any
other grid as we have today with the world wide web. Cross promotion and
"friend site" forging will continue to drive visitors on top of
uniqueness of content and word of mouth. Cool will always trump paid any
day of the week in my book.

As for why a gated grid will fail.... the best comparison that I can
think of is what happened with the BBS' of the 80's. Most started out as
pay to use since they absolutely had to cover their costs of phone lines
and computers. They got away with it for a while because everyone
charged, except for the underground boards. Over time the free boards
slowly took over because, well.... they were free! And they promoted
each other like mad. I forsee the same thing happening to LL and any
other company that atempts to continue the old paid board business
model. We can also see this model slowly taking over in the web content
space today. Why people keep trying to run the old models that are time
and again trumped by the free models, I don't know. Maybe it's nostalgia
for the old days of AOL and CompuServe! *grin* They certainly were cool,
until people figured out that they too could create things for others to
see. ;)

> 
> But all of this is a more long-term view, for the short term the most
> important thing is to try to figure out what exactly is the reason for
> the issues - and try to address it.

Agreed 100%!!

> 
> /d
> _______________________________________________
> Opensim-dev mailing list
> Opensim-dev at lists.berlios.de
> https://lists.berlios.de/mailman/listinfo/opensim-dev