<br clear="all">Early this morning there was (yet another) instance where certain

regions could be readily accessed by some and could not be accessed at

all by others.<br><br>This led to an insightfull discussion on IRC this morning, which I will attach for your convenience.<br><br>Here are some high points:<br><br> <img src="http://osgrid.org/forums/images/smilies/icon_arrow.gif" alt=":arrow:" title="Arrow">

Various users were able to access WP reliably last night while others

were completely unable to do so. The consistent relevant factor was a

high ping time from the affected clients to the region server in

question. A properly executed traceroute revealed that the network

fault lay with level3 networks, between the affected clients and the

region server in question.<br><br> <img src="http://osgrid.org/forums/images/smilies/icon_arrow.gif" alt=":arrow:" title="Arrow">

recently I was graced with the company of many new neighbors on the

grid. With the new neighbors came complete instability with respect to

my region server. Some of the instabilities were my own doing. Fixing

them did not completely address the problem. A subsequent relocation to

a neighbor-free location on the grid did.<br><br> <img src="http://osgrid.org/forums/images/smilies/icon_arrow.gif" alt=":arrow:" title="Arrow">

one of our newer contributors has audited much of the low-level packet

handling code and found it to be in want of some rearchitecting in the

interest of efficiency. This is a salient point; if the communications

code upon which the grid relies for region<>region and

client<>region communications is not optimum, we cannot count on

operations to be optimum. This also has profound implications for

troubleshooting, as all diagnostic data arrives via the conduits

implemented by this low-level comms code.<br><br> <img src="http://osgrid.org/forums/images/smilies/icon_arrow.gif" alt=":arrow:" title="Arrow">

adjacent misconfigured regions have been demonstrated to have a

significant impact on a given region's stability. Something needs to be

done to address this.<br><br>I will present some talking points in terms of these roles and their goals.<br><br>The set of roles and goals of the OSGrid may be summarized as a brief <span style="font-style: italic;">mission statement</span> which I characterise as follows:<br>

<br><blockquote class="uncited"><div>OSGrid

exists as a multirole support operation supporting and conducting

testing pursuant to the development of OpenSim virtual

reality/telepresence software. Further, it exists to provide a free

parking service for the development and testing community, both for

pure testing and content hosting. Finally, OSGrid exists to develop and

drive development of grid-related tools and/or code, whether as

freestanding utilities or source for contribution to the codebase of

OpenSim.</div></blockquote><br><br>Pursuant to fulfillment of these

roles, a course of action needs be devised which facilitates all of

these objectives simultaneously. Here is a plan which seeks to

accomplish this. Note that some key points are short term; others, less

so:<br><br> <img src="http://osgrid.org/forums/images/smilies/icon_idea.gif" alt=":idea:" title="Idea">

Packet handling technology critique. Involve directly those who have

worked on this code in the past with those who are interested in

carrying the work forward. Get either a commitment from it's original

authors to return to active development, or alternatively an informal

hand-off of the work to this group. Then perform a systematic audit on

this code, as a group. Document it as we go. Explaining how code works

to a piece of paper can show up just as many flaws as a trial run.

There's the side benefit of generating thorough documentation.<br><br>One

sequence of events currently in progress with respect to this, is that

several of our commercial contributors are about to introduce a major

set of stability and load balancing patches. This set of patches might

well replace or repair the affected code, so this may in fact become a

non-issue.<br><br>This does not mean we can afford to sit and wait for

them to make the drop. What we can take immediate action on, is to

prepare to profile the performance impact in quantitative ways once

this code has been dropped. In this interest, ChrisD will be placing

some regions alongside mine out in the open; our intention is to

maintain a seperate 'island', where we can excercise fine-grained

control over who our neighbors are (and more importantly, fine-grained

control of the neighbor configuration) for testing purposes. <br><br>We

also hope to be able to prove our theory about brokenly configured

adjacent regions. If you bring up a region adjacent to us, we will

likely ask you to move it or remove it ourselves if we are unable to

reach you.<br><br> <img src="http://osgrid.org/forums/images/smilies/icon_idea.gif" alt=":idea:" title="Idea">

Set up an automated vetting process for incoming regions on the grid.

Say, submit region xml files via HTTP/POST, and have them

programatically checked for syntactical correctness, geographical

collisions, network reachability, port configuration consistency,

completness of information, etc. before allowing a final connection

with grid services. <br><br>This also introduces a logical point for a

payment collection/verification process for that day when we have a

salable service, for those who are interested in providing for-pay grid

services at that time.<br><br> <img src="http://osgrid.org/forums/images/smilies/icon_idea.gif" alt=":idea:" title="Idea">

Establish an effective heartbeat mechanism between region servers and

grid server. If the heartbeat goes away, the region table entry is

appropriately updated to reflect the region state, and except for

geographic purposes and region heartbeat response, the region would be

ignored by grid services (as if it were not there). If/when the region

produces new signs of life, it would then be returned to active

participation with grid services. This could be taken a step further

with longterm monitoring of the region heartbeat. <br><br>Logic is

implied that after some configurable long period without a heartbeat

the region could be made available for purging from the tables

completely, either automatically or with operator intervention. The

benefits in bandwidth and cycles recovered from retry requests and

threads locked while waiting on unresponsive regions would far exceed

the relatively minor hit taken in producing and servicing a heartbeat.<br><br><br>The

balance of the material posted here is offered to the group in the

interest of stimulating conversation about these observations. None of

this is authorative, except in the sense that it represents my

observations and what I think can be done about them. I (and others)

are eager to hear from you on these matters to the extent you are

willing and able to participate in resolving them. Note that I dont

claim to represent any one else in this post, but I suspect that my

views are somewhat parallel with at least some in the community.<br><br>Please jump in and let us know what you think  <img src="http://osgrid.org/forums/images/smilies/icon_cool.gif" alt="8-)" title="Cool"><br><br>Cheers!<br>

Hiro Protagonist<br>-- <br>===================================<br>The wind<br>scours the earth for prayers<br>The night obscures them