[Opensim-dev] Concerning recent grid instability and inconsistency

Fri Feb 29 00:27:31 UTC 2008

Hiro,

excellent post.
comments inline.

On Wed, Feb 27, 2008 at 6:59 PM, James Stallings II <
james.stallings at gmail.com> wrote:

>
> Early this morning there was (yet another) instance where certain regions
> could be readily accessed by some and could not be accessed at all by
> others.
>
> This led to an insightfull discussion on IRC this morning, which I will
> attach for your convenience.
>
> Here are some high points:
>
> [image: :arrow:] Various users were able to access WP reliably last night
> while others were completely unable to do so. The consistent relevant factor
> was a high ping time from the affected clients to the region server in
> question. A properly executed traceroute revealed that the network fault lay
> with level3 networks, between the affected clients and the region server in
> question.
>
> [image: :arrow:] recently I was graced with the company of many new
> neighbors on the grid. With the new neighbors came complete instability with
> respect to my region server. Some of the instabilities were my own doing.
> Fixing them did not completely address the problem. A subsequent relocation
> to a neighbor-free location on the grid did.
>

I wonder if this is inter-region comms, or region-grid comms, or
region-client comms that were the issue - do we know by chance what of them
were the culprits ? Or the sequence of events that would help to pinpoint
this ?

>
> [image: :arrow:] one of our newer contributors has audited much of the
> low-level packet handling code and found it to be in want of some
> rearchitecting in the interest of efficiency. This is a salient point; if
> the communications code upon which the grid relies for region<>region and
> client<>region communications is not optimum, we cannot count on operations
> to be optimum. This also has profound implications for troubleshooting, as
> all diagnostic data arrives via the conduits implemented by this low-level
> comms code.
>

I thought region<->region was not working by means of packet handling, and
it was primarily for region<->client comms - do we have any more specifics ?

>
> [image: :arrow:] adjacent misconfigured regions have been demonstrated to
> have a significant impact on a given region's stability. Something needs to
> be done to address this.
>

again, first thing to do imho is to find out what is the trigger.

(I will save my usual rant that the "centralized" model is not scalable in a
"large" (as in "millions of regions" deployment model - we did have some
discussion with Ahzz on this but did not have the time yet to make the
sifnificant progress... The only thing I did is for now importing a
bsd-licensed DNS resolver code - we were agreeing that (ab)using the TXT
records in DNS is certainly a good way to go towards a more scalable system.

> I will present some talking points in terms of these roles and their
> goals.
>
> The set of roles and goals of the OSGrid may be summarized as a brief mission
> statement which I characterise as follows:
>
> OSGrid exists as a multirole support operation supporting and conducting
> testing pursuant to the development of OpenSim virtual reality/telepresence
> software. Further, it exists to provide a free parking service for the
> development and testing community, both for pure testing and content
> hosting. Finally, OSGrid exists to develop and drive development of
> grid-related tools and/or code, whether as freestanding utilities or source
> for contribution to the codebase of OpenSim.
>
>
>
> Pursuant to fulfillment of these roles, a course of action needs be
> devised which facilitates all of these objectives simultaneously. Here is a
> plan which seeks to accomplish this. Note that some key points are short
> term; others, less so:
>
> [image: :idea:] Packet handling technology critique. Involve directly
> those who have worked on this code in the past with those who are interested
> in carrying the work forward. Get either a commitment from it's original
> authors to return to active development, or alternatively an informal
> hand-off of the work to this group. Then perform a systematic audit on this
> code, as a group. Document it as we go. Explaining how code works to a piece
> of paper can show up just as many flaws as a trial run. There's the side
> benefit of generating thorough documentation.
>

again - packet handling, afaik, pertains largely to client<->region
communications - please correct me if I am wrong. Indeed this should become
the area of focus if we determine it is the botteneck, but I would think the
troubles like in the region<->region, and region<->grid server comms, which
are done by other means.

>
> One sequence of events currently in progress with respect to this, is that
> several of our commercial contributors are about to introduce a major set of
> stability and load balancing patches. This set of patches might well replace
> or repair the affected code, so this may in fact become a non-issue.
>
> This does not mean we can afford to sit and wait for them to make the
> drop. What we can take immediate action on, is to prepare to profile the
> performance impact in quantitative ways once this code has been dropped. In
> this interest, ChrisD will be placing some regions alongside mine out in the
> open; our intention is to maintain a seperate 'island', where we can
> excercise fine-grained control over who our neighbors are (and more
> importantly, fine-grained control of the neighbor configuration) for testing
> purposes.
>

yes - so pertaining to my comments above - narrowing down what exactly is
the trigger for the slowness is the key.

>
> We also hope to be able to prove our theory about brokenly configured
> adjacent regions. If you bring up a region adjacent to us, we will likely
> ask you to move it or remove it ourselves if we are unable to reach you.
>
> [image: :idea:] Set up an automated vetting process for incoming regions
> on the grid. Say, submit region xml files via HTTP/POST, and have them
> programatically checked for syntactical correctness, geographical
> collisions, network reachability, port configuration consistency,
> completness of information, etc. before allowing a final connection with
> grid services.
>

+1. Once we understand what kind of misconfigurations are causing the issues
- maybe we should go even further and address the root causes of them,
rather than checking the configuration.

>
> This also introduces a logical point for a payment collection/verification
> process for that day when we have a salable service, for those who are
> interested in providing for-pay grid services at that time.
>

This would provide a roadblock. The payment should still be a hook outside
of the "automated" process. (like, providing a crypto token of sorts, which
would be checked at the time of the region coming up)

>
> [image: :idea:] Establish an effective heartbeat mechanism between region
> servers and grid server. If the heartbeat goes away, the region table entry
> is appropriately updated to reflect the region state, and except for
> geographic purposes and region heartbeat response, the region would be
> ignored by grid services (as if it were not there). If/when the region
> produces new signs of life, it would then be returned to active
> participation with grid services. This could be taken a step further with
> longterm monitoring of the region heartbeat.
>

I think Ahzz had some patch in mantis and his git repo on this. Since I am
not much into centralized grid comms, I did not commit it - I think it was
introducing some changes to the SQL tables...

>
> Logic is implied that after some configurable long period without a
> heartbeat the region could be made available for purging from the tables
> completely, either automatically or with operator intervention. The benefits
> in bandwidth and cycles recovered from retry requests and threads locked
> while waiting on unresponsive regions would far exceed the relatively minor
> hit taken in producing and servicing a heartbeat.

I did start early experiments with using the DHT as a "grid service" - I've
installed a couple of bamboo DHT instances. Although obviously this drags a
few other issues (like the questions of "trust" between the operators of
instances of DHT, etc. Maybe having the DNS records that are periodically
updated, could be a better solution. We were discussing that kind of stuff
with Ahzz.. Again, the time issue :(

My personal opinion is that having a "pay-to-join-the-grid" is the wrong
business model as it forces the competition between the grid operators. It
should be "pay-to-store-your-identity" or
"pay-to-store-your-assets-and-other-junk" or "pay-to-store-your-sim" - and
the architecture should be driven around that, rather than the central
control point (aka single point of failure:) that the grid services really
are.

But all of this is a more long-term view, for the short term the most
important thing is to try to figure out what exactly is the reason for the
issues - and try to address it.

/d
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://opensimulator.org/pipermail/opensim-dev/attachments/20080229/409c07c7/attachment-0001.html>