[Opensim-dev] networking issues

Mon Mar 28 18:19:30 UTC 2011

Hi,

sounds great.

Some things to consider:

- Some actions require explicit sending of a packet which is an
update packet, but is used for special cases. Sit, stand, changing
group tags, creating/joining groups are all such cases where special
care needs to be taken.

- Resend is evil for static objects and avatars, but may be needed
to sync up dead reckoning with the real data on physical objects.
Just a feeling.

Melanie

Mic Bowman wrote:
> Over the last several weeks, Dan Lake & I have been looking some of the
> networking performance issues in opensim. As always, our concerns are with
> the problems caused by very complex scenes with very large numbers of
> avatars. However, I think some of the issues we have found will generally
> improve networking with OpenSim. Since the behavior represents a fairly
> significant change in behavior (though the number of lines of code is not
> great), I'm going to put this into a separate branch for testing (called
> queuetest) in the opensim git repository.
> 
> We've found several problems with the current
> networking/prioritization code.
> 
> * Reprioritization is completely broken for SceneObjectParts. On
> reprioritization, the current code uses the localid stored in the scene
> Entities list but since the scene does not store the localid for SOPs, that
> attempt always fails. So the original priority of the SOP continues to be
> used. This could be the cause of some problems since the initial
> prioritization assumes position 128,128. I don't understand all the possible
> ramifications, but suffice it to say, using the localid is causing
> problems.
> 
> Fix: The sceneentity is already stored in the update, just use that instead
> of the localid.
> 
> * We currently pull (by default) 100 entity updates from the entityupdate
> queue and convert them into packets. Once converted into packets, they are
> then queued again for transmissions. This is a bad thing. Under any kind of
> load, we've measured the time in the packet queue to be up to many
> hundreds/thousands of milliseconds (and to be highly variable). When an
> object changes one property and then doesn't change it again, the time in
> the packet queue is largely irrelevant. However, if the object is
> continuously changing (an avatar changing position, a physical object
> moving, etc) then the conversion from a entity update to a packet "freezes"
> the properties to be sent. If the object is continuously changing, then with
> fairly high probability, the packet contains old data (the properties of the
> entity from the point at which it was converted into a packet).
> 
> The real problem is that, in theory, to improve the efficiency of the
> packets (fill up each message) we are grabbing big chunks of updates. Under
> load, that causes queuing at the packet layer which makes updates stale.
> That is... queuing at the packet layer is BAD.
> 
> Fix: We implemented an adaptive algorithm for the number of updates to grab
> with each pass. We set a target time of 200ms for each iteration. That
> means, we are trying to bound the maximum age of any update in the packet
> queue to 200ms. The adaptive algorithm looks a lot like a TCP slow start:
> every time we complete an iteration (flush the packet queue) in less than
> 200ms we increase linearly the number of updates we take in the next
> iteration (add 5 to the count) and when we don't make it back in 200ms, we
> drop the number we take quadratically (cut the number in half). In our
> experiments with large numbers of moving avatars, this algorithm works
> *very* well. The number of updates taken per iteration stabilizes very
> quickly and the response time is dramatically improved (no "snap back" on
> avatars, for example). One difference from the traditional slow start...
> since the number of "static" items in the queue is very high when a client
> first enters a region, we start with the number of updates taken at 500.
> that gets the static items out of the queue quickly (and delay doesn't
> matter as much) and the number taken is generally stable before the
> login/teleport screen even goes away.
> 
> * The current prioritization queue can lead to update starvation. The
> prioritization algorithm dumps all entity updates into a single ordered
> queue. Lets say you have several hundred avatars moving around in a scene.
> Since we take a limited number of updates from the queue in each iteration,
> we will take only the updates for the "closest" (highest priority) avatars.
> However, since those avatars continue to move, they are re-inserted into the
> priority queue *ahead* of the updates that were already there. So... unless
> the queue can be completely emptied each iteration or the priority of the
> "distant" (low priority) avatars changes, those avatars will never be
> updated.
> 
> Fix: We converted the single priority queue into multiple priority queues
> and use fair queuing to retrieve updates from each. Here's how it works
> (more or less)... the current metrics (all of the current prioritization
> algorithms use distance at some point for prioritization) compute a distance
> from the avatar/camera to an object. We take the log of that distance and
> use that as the index for the queue where we place the update. So close
> things go into the highest priority queue and distant things go into the
> lowest priority queue. Since the area covered by a priority queue grows as
> the square of the radius, the distant (lowest priority queues) will have the
> most objects while the highest priority queues will have a small number of
> objects. Inside each priority queue, we order the updates by the time in
> which they entered the queue. Then we pull a fixed number of updates from
> each priority queue each iteration. The result is that local updates get a
> high fraction of the outgoing bandwidth but distant updates are guaranteed
> to get at least "some" of the bandwidth. No starvation. The current
> prioritization algorithm we implemented is a modification of the "best
> avatar responsiveness" and "front back" in that we use root prim location
> for child prims and the priority of updates "in back" of the avatar is lower
> than updates "in front". Our experiments show that the fair queuing does
> drain the update queue AND continues to provide a disproportionately high
> percentage of the bw to "close" updates.
> 
> One other note on this... we should be able to improve the performance of
> reprioritization with this approach. If we know the distance an avatar has
> moved, we only have to reprioritize objects that might have changed priority
> queues. Haven't implemented this yet but have some ideas for how to do it.
> 
> * The resend queue is evil. When an update packet is sent (they are marked
> reliable) it is moved to a queue to await acknowledgement. If no
> acknowledgement is received (in time), the packet is retransmitted and the
> wait time is doubled and so on... What that means is that a resend packets
> in a scene that is rapidly changing will often contain updates that are
> outdated. That is, when we resend the packet, we are just resending old data
> (and if you're having a lot of resends that means you already have a bad
> connection & now you're filling it up with useless data).
> 
> Fix: this isn't implemented yet (help would be appreciated)... we think that
> instead of saving packets for resend... a better solution would be to keep
> the entity updates that went into the packet. if we don't receive an ack in
> time, then put the entity updates back into the entity update queue (with
> entry time from their original enqueuing). That would ensure that we send an
> update for the object & that the data sent is the most recent.
> 
> * One final note... per client bandwidth throttles seem to work very well.
> however, our experiments with per-simulator throttles was not positive. it
> appeared that a small number of clients was consuming all of the bw
> available to the simulator and the rest were starved. Haven't looked into
> this any more.
> 
> 
> So...
> 
> Feedback appreciated... there is some logging code (disabled) in the branch;
> real data would be great. And help testing. there are a number of
> attachment, deletes and so on that i'm not sure work correctly.
> 
> --mic
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Opensim-dev mailing list
> Opensim-dev at lists.berlios.de
> https://lists.berlios.de/mailman/listinfo/opensim-dev