[Opensim-dev] Modifying the networking stack (UNCLASSIFIED)

Mon Nov 17 16:38:49 UTC 2014

Hi Doug and Michael,

Now that I'm also working on the viewer, and prompted by this thread, I 
found the code in the viewer that is responsible for the never-ending 
stream of AgentUpdate messages. From all I can tell, it looks like a 
bug. "Fixing" it results in no behavior changes, as far as I could see, 
and the "spam" stops.

The viewer devs are willing to fix it, at least for opensim. If/when 
they do, this will be the single most important performance optimization 
for opensim in a while. They are a bit more reluctant to change the 
viewer's behavior in the Linden grid. So I filed a bug report with 
Linden Lab, to see if that is an oversight or if they have a good reason 
for having viewers constantly sending agent updates. If there's a good 
reason, it would be good to know what it is.

The bug report is here:
https://jira.secondlife.com/browse/BUG-7816

Hopefully they will look at it and let us all know.

Best,
Diva

On 11/17/2014 6:02 AM, Maxwell, Douglas CIV USARMY ARL (US) wrote:
>
> Good Morning Crista, I appreciate the thoughtful response. Since we 
> have different requirements, it is only natural that our approaches to 
> the testing methodologies will diverge.  It is easy for us to create a 
> sim that can handle 50 active users or 100 passive human users.  (i.e. 
> 50 people moving about or 100 people seated.)  This validates your 
> assertion and we stand behind it with independent testing.  You are 
> welcome to attend MOSES office hours, should you want to see our grid 
> for yourself.
>
> What we need is more than 100 active users in a scene.  For example a 
> platoon of soldiers is 32+ users and if given a patrol mission, they 
> need to interact with the local population which could easily be 100 
> or more people in the scenario.  This is our scalability issue and we 
> require a different approach to injection of NPCs/bots into the system 
> to accurately test the simulator modifications.
>
> As you are aware it is risky to rely on 100's of volunteers for this 
> testing, further it not feasible or cost effective to expect hundreds 
> of paid people to participate.  We need to compose an automated way to 
> exercise our designs in a reliable manner.  As I said before, a 90th 
> percentile confidence would satisfy me.
>
> We apologize for not being clear earlier, Mike did not mean to imply 
> the MOSES grid had trouble with 2 users.  What we are saying is that 
> the open simulator is fragile.  We can easily create instabilities and 
> lag with *just* 2 users, meaning the grid is susceptible to even minor 
> perturbations.
>
> I believe we have parallel interests here in that we wish to make the 
> simulator usable with 10x more users than currently possible (hundreds 
> instead of just 10's).  Our development roadmap for the next 11 months 
> includes changes to data transport, physics, and possibly 
> databases.  We will continue to engage this mailing list and offer our 
> code back to the community to abide by all relevant open source 
> licensing terms.  This work is performed using a public GitHub site: 
> https://github.com/M-O-S-E-S
>
> Douglas Maxwell, MSME
> Science and Technology Manager
> Virtual World Strategic Applications
> U.S. Army Research Lab
> Simulation & Training Technology Center (STTC)
> (c)(407) 242-0209 <tel:%28407%29%20242-0209>
> ------------------------------------------------------------------------
> *From:* opensim-dev-bounces at opensimulator.org 
> [opensim-dev-bounces at opensimulator.org] on behalf of Diva Canto 
> [diva at metaverseink.com]
> *Sent:* Friday, November 14, 2014 12:03 PM
> *To:* opensim-dev at opensimulator.org
> *Subject:* Re: [Opensim-dev] Modifying the networking stack (UNCLASSIFIED)
>
> On 11/14/2014 8:46 AM, Maxwell, Douglas CIV USARMY ARL (US) wrote:
>> Classification: UNCLASSIFIED
>> Caveats: NONE
>>
>> Dr. Lopez, thank you for sharing your paper.  Can you tell me where it was
>> peer reviewed and published?  I would like to reference it in my
>> dissertation.
>
> Gabrielova, E., Lopes, C. V. (2014). Impact of Event Filtering on 
> OpenSimulator Server Performance. In Proceedings of the 2014 Summer 
> Simulation Multi-Conference. Summer Simulation Multi-Conference 
> (SummerSim), USA. The Society for Modeling & Simulation International.
>
>> On the topic of bots, the MOSES team has not been able to compose a NPC
>> agent or bot that accurately replicate the footprint of a human agent on the
>> simulator.
>
> They don't, and that's not how you should look at them. They are 
> artificial, highly controlled versions of viewers. They are lab 
> experiments. You can make them behave *exactly* how you want them to 
> behave, down to the number of AgentUpdates per second, so all 
> experimental conditions are under your control -- as opposed to using 
> real viewers driven by real people, which becomes impossible to 
> formulate what exactly they are doing. Testing with real people gives 
> you a *qualitative* view of how things are performing, and that's very 
> important, but it doesn't give you a sharp *quantitative* view of 
> what's going on.
>
> You should look at bot experiments as laboratory experiments in the 
> pharmaceutical industry. A very important first step with which you 
> can test a multitude of things in a highly controlled environment 
> before doing clinical tests with humans. The lab experiments will 
> point you to the most critical bottlenecks, and they are quantifiable.
>
> You don't necessarily need to run the bots on the same network as the 
> server. You can run them from amazon ec2 instances in different parts 
> of the world.
>
> Bots have proven invaluable for improving the performance of OpenSim 
> leading to  the conferences. I used them last year, and Justin has 
> used them even more this year.
>
>
>> We believe this is for many reasons:
>>
>> 1)  Bots are usually composed on a server on the same network, not dispersed
>> across the internet.  The bots should be software throttled and noise
>> introduced into their connections to approximate random access.
>>
>> 2)  Bots aren't using full clients, so they are not filling caches and
>> making the same scene requests as humans in graphical clients.
>>
>> 3)  Bots are usually homogenous.  They need to be randomly dressed, have
>> random attachments, and have random inventories.
>>
>> 4)  Bots need to move randomly and collide with objects in the scene and
>> with each other.
>>
>> 5)  Bots need to randomly chat with each other and broadcast locally.
>>
>> We think we can create a NPC solution that satisfies these issues.  Will
>> take some thought and development.  Has anyone come close to this?
>>
>> Goal:  Compose bots/NPCs that can approximate the loads of humans within 90%
>> certainty.  Meaning if we load 100 of these artificial agents into the
>> MOSES, we are certain that it will accurately behave as if at least 90
>> humans are logged in.
>>
>> IMHO, if you can't assign a reliability to a test, then you are just wasting
>> your time.  This is basic V&V tenants.
>>
>> v/r -douglas
>>
>> Douglas Maxwell, MSME
>> Science and Technology Manager
>> Virtual World Strategic Applications
>> U.S. Army Research Lab
>> Simulation & Training Technology Center (STTC)
>> (c) (407) 242-0209
>>
>>
>>
>> -----Original Message-----
>> From:opensim-dev-bounces at opensimulator.org
>> [mailto:opensim-dev-bounces at opensimulator.org] On Behalf Of Diva Canto
>> Sent: Friday, November 14, 2014 11:05 AM
>> To:opensim-dev at opensimulator.org
>> Subject: Re: [Opensim-dev] Modifying the networking stack
>>
>> On 11/14/2014 6:23 AM, Michael Heilmann wrote:
>>> Thanks for the responses.  I'll go into a little more detail:
>>>
>>> We have been running several profilers against OpenSimulator on the
>>> MOSES grid, and on my development machine.  The tests were to examine
>>> the loading on the server under several different loads, specifically
>>> mesh and physics loads.  What we found appears to be that no matter
>>> what kind of load we placed on the region, even to the point of
>>> becoming unresponsive due to physics and mesh, that scripting and
>>> physics load were nowhere near the amount of time spent in
>>> OpenSim.Region.ClientStack.LindenUDP once we had more than one or two
>>> avatars logged in.  We know from previous investigations at our
>>> firewall that network traffic for OpenSim is not that heavy,
>>> especially with low numbers of users.
>> If this is a problem, and you are running a recent-ish version of core
>> OpenSim, it sounds like some misconfiguration somewhere. Back in the summer
>> of 2013 we had a problem with the server running OSCC'13; the kernel was
>> configured to run in some sort of special mode that was making everything
>> run badly and unpredictably. We fixed the kernel configuration, and suddenly
>> things started running much more smoothly-- I don't remember the details,
>> but Nebadon may clarify things.
>>
>> OpenSim these days can handle 50 people on a single simulator without much
>> trouble. If you look at figure 7 of my paper
>> (http://www.ics.uci.edu/~lopes/documents/summersim14/gabrielova_lopes_prepri
>> nt.pdf)
>> you will see the quantification of "without much trouble." I suggest that
>> you reproduce my experimental conditions with pCamBot and check whether your
>> numbers are very different from ours. If they are very different, then
>> there's definitely something odd in your setup, as we were able to reproduce
>> these numbers in several machines. Feel free to contact me directly for
>> details about pCamBot configuration.
>>
>> Bots aren't real viewers, but they are much better for measuring things
>> systematically and detecting problems and bottlenecks than relying on real
>> users driven by real people. The performance you get with pCamBot will be
>> correlated with the performance you get with real users.
>>
>>
>>> I ran several Wireshark captures against a Firestorm viewer logging
>>> into the MOSES public grid ABWIS region, where we hold our office
>>> hours.  I saw that with our current configuration, all traffic between
>>> the server and my client, with the exception of http CAPS and fsapi
>>> calls, were UDP traffic.  This is not immediately concerning, as we
>>> have simian serve our mesh and textures directly. The messages are
>>> mostly binary information, so I could not examine closely, but I did
>>> see a lot of messages containing identical ASCII strings, such as the
>>> name of my avatar.
>> Hard to say what you saw, but I bet those are the AgentUpdate messages that
>> I mentioned before. The viewer sends at least 10/sec. At points, the viewer
>> sends much more than 10/sec, up to 60/sec. Again, take a look at my paper
>> for understanding what those are, and how OpenSim deals with them since
>> OSCC'13.
>>
>> As I said before, it would be nice to understand why the viewer is so eager
>> to blabber its status to the server when nothing is going on.
>>
>>
>>> My primary concern is the amount of time spent handling networking,
>>> not necessarily the networking its-self.  But there is at least a
>>> portion of messages on the UDP pipeline that are either reliable, or
>>> perhaps should be; and re-implementing a reliable transport over udp
>>> introduces load at the application layer, instead of letting a
>>> low-level reliable transport such as tcp handle it.  I went to
>>> university with a guy who implemented a java networking library
>>> completely over UDP, believing that it was faster than a normal TCP
>>> socket; but he was neglecting that the networking hardware handles the
>>> ACK and retransmission transparently, and without needing for the
>>> messages to be handled manually by the application.
>>>
>>> This may just be my opinion, but since I was going to be ecamining the
>>> network stack anyways, and typically in a client-server scenario the
>>> ability to maintain a persistent reliable connection where the server
>>> can push important events to the client, that it would be a good
>>> idea.  The points about network throttling and QoS are taken, but
>>> wouldn't they also typically affect the UDP stream? Working on MOSES I
>>> have plenty of problems dealing with external users who operate on
>>> restricted networks, and they cannot see traffic aside from 80 and 443
>>> without dealing with their own IT personnel.  The fact that it is HTTP
>>> over TCP instead of raw TCP makes no difference once it is on a
>>> non-standard HTTP port.
>>>
>>> I agree that it would be more prudent to look at improving the
>>> websocket code and the http server, rather than replace it with a raw
>>> TCP socket, especially given that there are multiple plugins, such as
>>> jsonsimstats, that use the http functionality directly.
>>>
>>> I hope that explains my position a little better.  I would love to
>>> hear if there are other plans/ideas in the community to address
>>> time-sinks like this one, networking simply appears to us as a good
>>> starting point to increase performance and scalability of the system.
>>>
>> _______________________________________________
>> Opensim-dev mailing list
>> Opensim-dev at opensimulator.org
>> http://opensimulator.org/cgi-bin/mailman/listinfo/opensim-dev
>>
>> Classification: UNCLASSIFIED
>> Caveats: NONE
>>
>>
>>
>>
>> _______________________________________________
>> Opensim-dev mailing list
>> Opensim-dev at opensimulator.org
>> http://opensimulator.org/cgi-bin/mailman/listinfo/opensim-dev
>
>
>
> _______________________________________________
> Opensim-dev mailing list
> Opensim-dev at opensimulator.org
> http://opensimulator.org/cgi-bin/mailman/listinfo/opensim-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://opensimulator.org/pipermail/opensim-dev/attachments/20141117/f25a6472/attachment-0001.html>