[Opensim-dev] networking issues

Mon Mar 28 19:48:43 UTC 2011

I'm not sure how often the viewer adjusts but I have seen message on the
linux viewer console which mentioned throttle adjustments. At the time they
seemed quite frequent, perhaps in tens of seconds between messages.

On Mon, Mar 28, 2011 at 12:34 PM, Mic Bowman <cmickeyb at gmail.com> wrote:

> comments below...
>
> On Mon, Mar 28, 2011 at 11:49 AM, Dahlia Trimble <dahliatrimble at gmail.com>wrote:
>
>> a couple thoughts..
>>
>> Perhaps resend timeout period could be a function of throttle setting
>> and/or measured packet acknowledgement time per-client? (provided we measure
>> it). That may prevent excessive resend processing that may not be necessary.
>>
>>
> how/when does the viewer change its throttles? i have an "interesting"
> network connection to the public internet from intel (very long cable -->
> periodic high error rates)... i've seen 300% packet loss rates (explain that
> one!)... but never see a "throttle down" packet that would drop the update
> rates back to a range where the network can reasonably handle it.
>
> dan & i were talking today about how to adjust the throttles simulator-side
> when we start to see large number of packet retransmissions.
>
>
>> On the distance prioritization, could small changed in object translations
>> be discarded from the prioritization queues/resend buffers for distant
>> objects when new updates occur for those objects? Small changes may not be
>> noticeable from the viewer perspective anyway.
>>
>>
> dropping the update altogether is one approach but seems like it could mess
> up state. deprioritizing based on the magnitude of change seems like a
> better approach?? i think meru uses a version of angular distance. however,
> the computation of that is pretty expensive.
>
> with deprioritizing... we see updates accumulating. since the entity stays
> in the queue longer, you tend to accumulate greater and greater changes.
>
>
>>
>>
>> On Mon, Mar 28, 2011 at 10:48 AM, Teravus Ovares <teravus at gmail.com>wrote:
>>
>>> Here are a few facts that I've personally discovered while working
>>> with LLClientView.
>>>
>>> 1. It has been noted that people with poor connections to the
>>> simulator do consume more bandwidth, cpu, and have a generally worse
>>> experience.   This has been tested and profiled extensively.    This
>>> may seem like a small issue because what it's doing is so basic...
>>> however the frequency in which this occurs is a real cause of
>>> performance issues.
>>>
>>> 2. It's also noted that the CPU used in these cases reduces the CPU
>>> available to the rest of the simulator resulting in a lower quality of
>>> service for the rest of the people on the simulator.
>>> This has been seen in the profiling and has been qualitatively
>>> observed by a large number of users connected and everything is OK and
>>> then a 'problem connection' user connecting causing a wide range of
>>> issues.
>>>
>>> 3. It's also noted that lowering the outgoing UDP packet throttles
>>> beyond a certain point results in perpetual queuing and resends.
>>> This was tested by using a throttle multiplier last year that was
>>> implemented by justincc.  I'm not sure if the multiplier is still
>>> there.   It's most easily seen with image packets.   Again, I note
>>> that the packets are not rebuilt going from the regular outbound queue
>>> to the resend queue.    The resend queue is /supposed/ to be used to
>>> quickly get data that is essential to the client after attempting to
>>> send once already.   The UDP spec declares the maximum resend to be 2
>>> times, however there has been some considerable debate on whether or
>>> not OpenSimulator should follow that specific specification item
>>> leading to a configuration option to enable perpetual resends
>>> (Implemented by Melanie).  The configuration item was named similar
>>> to, 'reliable is important' or something like that.   I'm not sure if
>>> the configuration item survived the many revisions however I suspect
>>> that it did.
>>>
>>> 4. It's also noted that raising the packet throttles beyond what the
>>> connection can support results in resending almost every packet the
>>> maximum amount of times before the limit is reached.
>>> This is easily reproducible by setting the connection (in the client)
>>> to the maximum and connecting to a region that you've never been to
>>> before on a sub par connection.   Before the client adjusts and
>>> requests a lower throttle setting there's massive data loss and
>>> massive re-queuing.
>>>
>>> 5. The client tries to adjust the throttle settings based on network
>>> conditions.   This can be observed by monitoring the packet that sets
>>> the throttles and dragging the bar to maximum.   After a certain
>>> amount of resends, the client will call the set throttle packet with
>>> reduced settings (some argue that it doesn't do that fast enough).
>>>
>>> 6. A user who has connected previously to the simulator will use less
>>> resources then a user who has never connected to the simulator.  (this
>>> is mostly because of the image cache on the client).    Any client
>>> that uses CAPS images will use less resources then one that uses
>>> LLUDP.
>>>
>>> When working with the packet queues, it's essential to understand
>>> those 6 observations.   Even though, the place where you tend to see
>>> the issues with queuing is the image queue over LLUDP, the principles
>>> apply to all of the udp queues.
>>>
>>> Regards
>>>
>>> Teravus
>>>
>>>
>>> On Mon, Mar 28, 2011 at 1:00 PM, Mic Bowman <cmickeyb at gmail.com> wrote:
>>> > Over the last several weeks, Dan Lake & I have been looking some of the
>>> > networking performance issues in opensim. As always, our concerns are
>>> with
>>> > the problems caused by very complex scenes with very large numbers of
>>> > avatars. However, I think some of the issues we have found will
>>> generally
>>> > improve networking with OpenSim. Since the behavior represents a fairly
>>> > significant change in behavior (though the number of lines of code is
>>> not
>>> > great), I'm going to put this into a separate branch for testing
>>> (called
>>> > queuetest) in the opensim git repository.
>>> > We've found several problems with the current
>>> > networking/prioritization code.
>>> > * Reprioritization is completely broken for SceneObjectParts. On
>>> > reprioritization, the current code uses the localid stored in the scene
>>> > Entities list but since the scene does not store the localid for SOPs,
>>> that
>>> > attempt always fails. So the original priority of the SOP continues to
>>> be
>>> > used. This could be the cause of some problems since the initial
>>> > prioritization assumes position 128,128. I don't understand all the
>>> possible
>>> > ramifications, but suffice it to say, using the localid is causing
>>> > problems.
>>> > Fix: The sceneentity is already stored in the update, just use that
>>> instead
>>> > of the localid.
>>> > * We currently pull (by default) 100 entity updates from the
>>> entityupdate
>>> > queue and convert them into packets. Once converted into packets, they
>>> are
>>> > then queued again for transmissions. This is a bad thing. Under any
>>> kind of
>>> > load, we've measured the time in the packet queue to be up to many
>>> > hundreds/thousands of milliseconds (and to be highly variable). When an
>>> > object changes one property and then doesn't change it again, the time
>>> in
>>> > the packet queue is largely irrelevant. However, if the object is
>>> > continuously changing (an avatar changing position, a physical object
>>> > moving, etc) then the conversion from a entity update to a packet
>>> "freezes"
>>> > the properties to be sent. If the object is continuously changing, then
>>> with
>>> > fairly high probability, the packet contains old data (the properties
>>> of the
>>> > entity from the point at which it was converted into a packet).
>>> > The real problem is that, in theory, to improve the efficiency of the
>>> > packets (fill up each message) we are grabbing big chunks of updates.
>>> Under
>>> > load, that causes queuing at the packet layer which makes updates
>>> stale.
>>> > That is... queuing at the packet layer is BAD.
>>> > Fix: We implemented an adaptive algorithm for the number of updates to
>>> grab
>>> > with each pass. We set a target time of 200ms for each iteration. That
>>> > means, we are trying to bound the maximum age of any update in the
>>> packet
>>> > queue to 200ms. The adaptive algorithm looks a lot like a TCP slow
>>> start:
>>> > every time we complete an iteration (flush the packet queue) in less
>>> than
>>> > 200ms we increase linearly the number of updates we take in the next
>>> > iteration (add 5 to the count) and when we don't make it back in 200ms,
>>> we
>>> > drop the number we take quadratically (cut the number in half). In our
>>> > experiments with large numbers of moving avatars, this algorithm works
>>> > *very* well. The number of updates taken per iteration stabilizes very
>>> > quickly and the response time is dramatically improved (no "snap back"
>>> on
>>> > avatars, for example). One difference from the traditional slow
>>> start...
>>> > since the number of "static" items in the queue is very high when a
>>> client
>>> > first enters a region, we start with the number of updates taken at
>>> 500.
>>> > that gets the static items out of the queue quickly (and delay doesn't
>>> > matter as much) and the number taken is generally stable before the
>>> > login/teleport screen even goes away.
>>> > * The current prioritization queue can lead to update starvation. The
>>> > prioritization algorithm dumps all entity updates into a single ordered
>>> > queue. Lets say you have several hundred avatars moving around in a
>>> scene.
>>> > Since we take a limited number of updates from the queue in each
>>> iteration,
>>> > we will take only the updates for the "closest" (highest priority)
>>> avatars.
>>> > However, since those avatars continue to move, they are re-inserted
>>> into the
>>> > priority queue *ahead* of the updates that were already there. So...
>>> unless
>>> > the queue can be completely emptied each iteration or the priority of
>>> the
>>> > "distant" (low priority) avatars changes, those avatars will never be
>>> > updated.
>>> > Fix: We converted the single priority queue into multiple priority
>>> queues
>>> > and use fair queuing to retrieve updates from each. Here's how it works
>>> > (more or less)... the current metrics (all of the current
>>> prioritization
>>> > algorithms use distance at some point for prioritization) compute a
>>> distance
>>> > from the avatar/camera to an object. We take the log of that distance
>>> and
>>> > use that as the index for the queue where we place the update. So close
>>> > things go into the highest priority queue and distant things go into
>>> the
>>> > lowest priority queue. Since the area covered by a priority queue grows
>>> as
>>> > the square of the radius, the distant (lowest priority queues) will
>>> have the
>>> > most objects while the highest priority queues will have a small number
>>> of
>>> > objects. Inside each priority queue, we order the updates by the time
>>> in
>>> > which they entered the queue. Then we pull a fixed number of updates
>>> from
>>> > each priority queue each iteration. The result is that local updates
>>> get a
>>> > high fraction of the outgoing bandwidth but distant updates are
>>> guaranteed
>>> > to get at least "some" of the bandwidth. No starvation. The current
>>> > prioritization algorithm we implemented is a modification of the "best
>>> > avatar responsiveness" and "front back" in that we use root prim
>>> location
>>> > for child prims and the priority of updates "in back" of the avatar is
>>> lower
>>> > than updates "in front". Our experiments show that the fair queuing
>>> does
>>> > drain the update queue AND continues to provide a disproportionately
>>> high
>>> > percentage of the bw to "close" updates.
>>> > One other note on this... we should be able to improve the performance
>>> of
>>> > reprioritization with this approach. If we know the distance an avatar
>>> has
>>> > moved, we only have to reprioritize objects that might have changed
>>> priority
>>> > queues. Haven't implemented this yet but have some ideas for how to do
>>> it.
>>> > * The resend queue is evil. When an update packet is sent (they are
>>> marked
>>> > reliable) it is moved to a queue to await acknowledgement. If no
>>> > acknowledgement is received (in time), the packet is retransmitted and
>>> the
>>> > wait time is doubled and so on... What that means is that a resend
>>> packets
>>> > in a scene that is rapidly changing will often contain updates that are
>>> > outdated. That is, when we resend the packet, we are just resending old
>>> data
>>> > (and if you're having a lot of resends that means you already have a
>>> bad
>>> > connection & now you're filling it up with useless data).
>>> > Fix: this isn't implemented yet (help would be appreciated)... we think
>>> that
>>> > instead of saving packets for resend... a better solution would be to
>>> keep
>>> > the entity updates that went into the packet. if we don't receive an
>>> ack in
>>> > time, then put the entity updates back into the entity update queue
>>> (with
>>> > entry time from their original enqueuing). That would ensure that we
>>> send an
>>> > update for the object & that the data sent is the most recent.
>>> > * One final note... per client bandwidth throttles seem to work very
>>> well.
>>> > however, our experiments with per-simulator throttles was not positive.
>>> it
>>> > appeared that a small number of clients was consuming all of the bw
>>> > available to the simulator and the rest were starved. Haven't looked
>>> into
>>> > this any more.
>>> >
>>> > So...
>>> > Feedback appreciated... there is some logging code (disabled) in the
>>> branch;
>>> > real data would be great. And help testing. there are a number of
>>> > attachment, deletes and so on that i'm not sure work correctly.
>>> > --mic
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > Opensim-dev mailing list
>>> > Opensim-dev at lists.berlios.de
>>> > https://lists.berlios.de/mailman/listinfo/opensim-dev
>>> >
>>> >
>>> _______________________________________________
>>> Opensim-dev mailing list
>>> Opensim-dev at lists.berlios.de
>>> https://lists.berlios.de/mailman/listinfo/opensim-dev
>>>
>>
>>
>> _______________________________________________
>> Opensim-dev mailing list
>> Opensim-dev at lists.berlios.de
>> https://lists.berlios.de/mailman/listinfo/opensim-dev
>>
>>
>
> _______________________________________________
> Opensim-dev mailing list
> Opensim-dev at lists.berlios.de
> https://lists.berlios.de/mailman/listinfo/opensim-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://opensimulator.org/pipermail/opensim-dev/attachments/20110328/b524d3ed/attachment-0001.html>