[Opensim-dev] networking issues

Mon Mar 28 17:48:08 UTC 2011

Here are a few facts that I've personally discovered while working
with LLClientView.

1. It has been noted that people with poor connections to the
simulator do consume more bandwidth, cpu, and have a generally worse
experience.   This has been tested and profiled extensively.    This
may seem like a small issue because what it's doing is so basic...
however the frequency in which this occurs is a real cause of
performance issues.

2. It's also noted that the CPU used in these cases reduces the CPU
available to the rest of the simulator resulting in a lower quality of
service for the rest of the people on the simulator.
This has been seen in the profiling and has been qualitatively
observed by a large number of users connected and everything is OK and
then a 'problem connection' user connecting causing a wide range of
issues.

3. It's also noted that lowering the outgoing UDP packet throttles
beyond a certain point results in perpetual queuing and resends.
This was tested by using a throttle multiplier last year that was
implemented by justincc.  I'm not sure if the multiplier is still
there.   It's most easily seen with image packets.   Again, I note
that the packets are not rebuilt going from the regular outbound queue
to the resend queue.    The resend queue is /supposed/ to be used to
quickly get data that is essential to the client after attempting to
send once already.   The UDP spec declares the maximum resend to be 2
times, however there has been some considerable debate on whether or
not OpenSimulator should follow that specific specification item
leading to a configuration option to enable perpetual resends
(Implemented by Melanie).  The configuration item was named similar
to, 'reliable is important' or something like that.   I'm not sure if
the configuration item survived the many revisions however I suspect
that it did.

4. It's also noted that raising the packet throttles beyond what the
connection can support results in resending almost every packet the
maximum amount of times before the limit is reached.
This is easily reproducible by setting the connection (in the client)
to the maximum and connecting to a region that you've never been to
before on a sub par connection.   Before the client adjusts and
requests a lower throttle setting there's massive data loss and
massive re-queuing.

5. The client tries to adjust the throttle settings based on network
conditions.   This can be observed by monitoring the packet that sets
the throttles and dragging the bar to maximum.   After a certain
amount of resends, the client will call the set throttle packet with
reduced settings (some argue that it doesn't do that fast enough).

6. A user who has connected previously to the simulator will use less
resources then a user who has never connected to the simulator.  (this
is mostly because of the image cache on the client).    Any client
that uses CAPS images will use less resources then one that uses
LLUDP.

When working with the packet queues, it's essential to understand
those 6 observations.   Even though, the place where you tend to see
the issues with queuing is the image queue over LLUDP, the principles
apply to all of the udp queues.

Regards

Teravus

On Mon, Mar 28, 2011 at 1:00 PM, Mic Bowman <cmickeyb at gmail.com> wrote:
> Over the last several weeks, Dan Lake & I have been looking some of the
> networking performance issues in opensim. As always, our concerns are with
> the problems caused by very complex scenes with very large numbers of
> avatars. However, I think some of the issues we have found will generally
> improve networking with OpenSim. Since the behavior represents a fairly
> significant change in behavior (though the number of lines of code is not
> great), I'm going to put this into a separate branch for testing (called
> queuetest) in the opensim git repository.
> We've found several problems with the current
> networking/prioritization code.
> * Reprioritization is completely broken for SceneObjectParts. On
> reprioritization, the current code uses the localid stored in the scene
> Entities list but since the scene does not store the localid for SOPs, that
> attempt always fails. So the original priority of the SOP continues to be
> used. This could be the cause of some problems since the initial
> prioritization assumes position 128,128. I don't understand all the possible
> ramifications, but suffice it to say, using the localid is causing
> problems.
> Fix: The sceneentity is already stored in the update, just use that instead
> of the localid.
> * We currently pull (by default) 100 entity updates from the entityupdate
> queue and convert them into packets. Once converted into packets, they are
> then queued again for transmissions. This is a bad thing. Under any kind of
> load, we've measured the time in the packet queue to be up to many
> hundreds/thousands of milliseconds (and to be highly variable). When an
> object changes one property and then doesn't change it again, the time in
> the packet queue is largely irrelevant. However, if the object is
> continuously changing (an avatar changing position, a physical object
> moving, etc) then the conversion from a entity update to a packet "freezes"
> the properties to be sent. If the object is continuously changing, then with
> fairly high probability, the packet contains old data (the properties of the
> entity from the point at which it was converted into a packet).
> The real problem is that, in theory, to improve the efficiency of the
> packets (fill up each message) we are grabbing big chunks of updates. Under
> load, that causes queuing at the packet layer which makes updates stale.
> That is... queuing at the packet layer is BAD.
> Fix: We implemented an adaptive algorithm for the number of updates to grab
> with each pass. We set a target time of 200ms for each iteration. That
> means, we are trying to bound the maximum age of any update in the packet
> queue to 200ms. The adaptive algorithm looks a lot like a TCP slow start:
> every time we complete an iteration (flush the packet queue) in less than
> 200ms we increase linearly the number of updates we take in the next
> iteration (add 5 to the count) and when we don't make it back in 200ms, we
> drop the number we take quadratically (cut the number in half). In our
> experiments with large numbers of moving avatars, this algorithm works
> *very* well. The number of updates taken per iteration stabilizes very
> quickly and the response time is dramatically improved (no "snap back" on
> avatars, for example). One difference from the traditional slow start...
> since the number of "static" items in the queue is very high when a client
> first enters a region, we start with the number of updates taken at 500.
> that gets the static items out of the queue quickly (and delay doesn't
> matter as much) and the number taken is generally stable before the
> login/teleport screen even goes away.
> * The current prioritization queue can lead to update starvation. The
> prioritization algorithm dumps all entity updates into a single ordered
> queue. Lets say you have several hundred avatars moving around in a scene.
> Since we take a limited number of updates from the queue in each iteration,
> we will take only the updates for the "closest" (highest priority) avatars.
> However, since those avatars continue to move, they are re-inserted into the
> priority queue *ahead* of the updates that were already there. So... unless
> the queue can be completely emptied each iteration or the priority of the
> "distant" (low priority) avatars changes, those avatars will never be
> updated.
> Fix: We converted the single priority queue into multiple priority queues
> and use fair queuing to retrieve updates from each. Here's how it works
> (more or less)... the current metrics (all of the current prioritization
> algorithms use distance at some point for prioritization) compute a distance
> from the avatar/camera to an object. We take the log of that distance and
> use that as the index for the queue where we place the update. So close
> things go into the highest priority queue and distant things go into the
> lowest priority queue. Since the area covered by a priority queue grows as
> the square of the radius, the distant (lowest priority queues) will have the
> most objects while the highest priority queues will have a small number of
> objects. Inside each priority queue, we order the updates by the time in
> which they entered the queue. Then we pull a fixed number of updates from
> each priority queue each iteration. The result is that local updates get a
> high fraction of the outgoing bandwidth but distant updates are guaranteed
> to get at least "some" of the bandwidth. No starvation. The current
> prioritization algorithm we implemented is a modification of the "best
> avatar responsiveness" and "front back" in that we use root prim location
> for child prims and the priority of updates "in back" of the avatar is lower
> than updates "in front". Our experiments show that the fair queuing does
> drain the update queue AND continues to provide a disproportionately high
> percentage of the bw to "close" updates.
> One other note on this... we should be able to improve the performance of
> reprioritization with this approach. If we know the distance an avatar has
> moved, we only have to reprioritize objects that might have changed priority
> queues. Haven't implemented this yet but have some ideas for how to do it.
> * The resend queue is evil. When an update packet is sent (they are marked
> reliable) it is moved to a queue to await acknowledgement. If no
> acknowledgement is received (in time), the packet is retransmitted and the
> wait time is doubled and so on... What that means is that a resend packets
> in a scene that is rapidly changing will often contain updates that are
> outdated. That is, when we resend the packet, we are just resending old data
> (and if you're having a lot of resends that means you already have a bad
> connection & now you're filling it up with useless data).
> Fix: this isn't implemented yet (help would be appreciated)... we think that
> instead of saving packets for resend... a better solution would be to keep
> the entity updates that went into the packet. if we don't receive an ack in
> time, then put the entity updates back into the entity update queue (with
> entry time from their original enqueuing). That would ensure that we send an
> update for the object & that the data sent is the most recent.
> * One final note... per client bandwidth throttles seem to work very well.
> however, our experiments with per-simulator throttles was not positive. it
> appeared that a small number of clients was consuming all of the bw
> available to the simulator and the rest were starved. Haven't looked into
> this any more.
>
> So...
> Feedback appreciated... there is some logging code (disabled) in the branch;
> real data would be great. And help testing. there are a number of
> attachment, deletes and so on that i'm not sure work correctly.
> --mic
>
>
>
>
>
> _______________________________________________
> Opensim-dev mailing list
> Opensim-dev at lists.berlios.de
> https://lists.berlios.de/mailman/listinfo/opensim-dev
>
>