Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0008873opensim[REGION] OpenSim Corepublic2021-03-07 17:162021-04-15 12:16
Reportermewtwo0641 
Assigned To 
PrioritynormalSeveritymajorReproducibilityrandom
StatusnewResolutionopen 
PlatformOperating SystemOperating System Version
Product Version 
Target VersionFixed in Version 
Summary0008873: Viewer stops receiving updates from sim at random
DescriptionOn master, the viewer seems to randomly stop receiving updates from the sim. This is characterized by the avatar not being able to move (can only spin on the spot), not being able to see chat in general chat, no longer see other users log on/off, etc.

It doesn't seem to be a complete disconnect because IM still works, and other people can see you move (even though it appears to you as you're stuck on the spot), and the viewer doesn't give you the "you've been disconnected" message.

This started happening within the past 2 or 3 months of commits on master branch. This wasn't an issue before this time period, or at least, it happened very infrequently, but now it's happening many times a day, and the only fix is to relog which only fixes the issue temporarily.

I have gone long periods of time without realizing that it was in that semi-frozen state until someone IMs me and asks me to look at something they're working on for my opinion on it, and I'm really confused, because to me, it looks like they're just standing there doing nothing.

The issue doesn't seem to be viewer specific, I have tried latest Firestorm and latest Singularity Viewer.
Steps To ReproduceNo known steps to reproduce at the time; the issue is random
Additional InformationCommits tested:

915bc74ab40918294f4b0d52a9e77bf9eed2ec1a (2/16/2021) - Freeze

0c716cbd732608643b7f3ba5e83968e723c2efe6 (7/24/2020) - Freeze

bb56157c92376845a0e1b7a57fbad805bd45e72f (7/23/2020) - No freeze after testing for a little over a week under various network conditions, CPU/RAM loads, and people randomly connecting to the grid.

1c4300ff916a12e8a35e0390365c36cc79e8339f (6/28/2020) - No freeze after testing for a little over a week under various network conditions, CPU/RAM loads, and people randomly connecting to the grid.

--------------------------------------------------------------

It seems that freezes started happening sometime between 1c4300ff916a12e8a35e0390365c36cc79e8339f (6/28/2020) and 0c716cbd732608643b7f3ba5e83968e723c2efe6 (7/24/2020)
TagsNo tags attached.
Git Revision or version number
Run Mode Grid (Multiple Regions per Sim)
Physics EngineubODE
Script EngineYEngine
Environment.NET / Windows64
Mono VersionNone
Viewer
Attached Filespatch file icon 0071-Revert-01.patch [^] (1,157 bytes) 2021-03-30 12:21 [Show Content]
patch file icon 0072-Revert-02.patch [^] (6,553 bytes) 2021-03-30 12:21 [Show Content]
patch file icon 0073-Revert-03.patch [^] (3,582 bytes) 2021-03-30 12:21 [Show Content]

- Relationships

-  Notes
(0037605)
tampa (reporter)
2021-03-07 23:40

Region console reporting and resent packets?
Tried a different port in the viewer networking?
Take a look at your local network, see if anything is spamming it and maybe disconnect any other computer. Give the router/modem a kick and check QoS setup.
(0037606)
BillBlight (developer)
2021-03-08 00:27

Also some recent AV/Firewall updates seem to to think some opensim traffic is a type of network attack and blocks it ..
(0037607)
mewtwo0641 (reporter)
2021-03-08 01:43

@tampa - The console does report on duplicated packets, these usually only show up upon login and stop after a few moments. I am unsure why this happens. I have noticed this for a long time though, before the freezing issue was a problem.

Have not tried a different port on viewer yet

Nothing seems abnormal on the network, router bandwidth graphs don't show any sudden major spikes, and task manager network graphs on all computers show low use most of the time (< 1%) unless I am purposefully doing something such as downloading a large file; I am usually not doing this when the freezes occur though (A few times it has happened when I was reading a book and all my systems were idle network wise with the exception for OpenSim).

I did find out a few days ago that the NIC on the system running OpenSim was causing odd issues with spiking the CPU up really hard for extended periods of time at random, so I replaced the card thinking perhaps that was causing the freezing issue... The CPU spiking issue went away after replacement, but the freezes on OpenSim still occur.

QoS is disabled in router setup

Router and modem have been restarted a number of times over the course of the past few months and noticing the issue, but issue still persists. Everything else works fine on the network with no issue.

The system in question has also recently had the operating system reinstalled from scratch as a means for troubleshooting, issue was occurring before the reinstall, and still occurs after the reinstall.

@Bill - I tried testing with AV and Firewall disabled (their processes not running at all) but the issue still occurs

--------------------

As a side note, the system running OpenSim is on a LAN local to the computers connecting to it, with several users connecting to it at any point over WAN. I haven't had the WAN users report on the freezing (maybe they haven't noticed, or maybe they have, and just relogged without a second thought and didn't report on it.), but me personally, connecting over LAN, I do see the issue quite often.
(0037608)
UbitUmarov (administrator)
2021-03-08 07:59

other thing i can't currently repo.
both on linux(remote sim) and win (local machine sim)
(0037609)
mewtwo0641 (reporter)
2021-03-08 10:22

It is very random. Some days it will be fine and will never experience the issue, other days it may only happen once or twice, and still other days it will happen many times throughout the day.

It might be just coincidence but one thing I have noticed is, one of the users that log onto the grid has an unstable ISP (which can't be helped, it's the only ISP option they have right now), and when they are having "a bad ISP day" and are constantly dropping packets, is when I seem to notice the freezes happen for me a lot more often. But that doesn't make any sense to me... Why would their unstable connection cause problems for other people on the sim? And why would that cause issues with the sim now and not a few months back in commits?
(0037610)
UbitUmarov (administrator)
2021-03-08 10:27

a bad connection can mean more traffic with udp (and low level tcp) retries back and fw.
Viewers even do very silly( should say stupid?) retries on HTTP!!!
Impact on others depends on how busy is the server and its network.
Should not be that high impact (?)
(0037611)
tampa (reporter)
2021-03-08 12:55

I also still get reports of some, not all, regions being weird, refusing teleports, desync or weird behavior of groupchat(granted that's probably not directly related).

On master the changes since Feb 12th seemed to make the situation worse, I reverted most of it and that seemed to help a bit, but not entirely. I see no packet loss for myself, but yet get stuck not sending presence to new simulator until after the timeout for said transfer is reached.

Haven't had the time to fully investigate this further and determine if master has issues more or less.

Best approach is now to try and find a way to determine since exactly when issues began in an attempt to pinpoint commits that, when reverted, reduce or eliminate the issue. Work from there what the heck the compiler broke this time. One of those "it compiles but doesn't work" issues unfortunately.

Another option would be to try and add additional nunit tests to give the potential points of failure more scrutiny, if only to test for more failure points.
(0037612)
tampa (reporter)
2021-03-09 11:26

Do have one thing to add: https://www.youtube.com/watch?v=RiBG07uA23Q [^]

This has been going on for years though, sometimes it seems the viewer just doesn't communicate even such minor things as prim scale and position. I can replicate this fairly well when prims have sat on a region for a couple days.
(0037614)
mewtwo0641 (reporter)
2021-03-09 13:27

I think that it might be quite difficult to pin down an exact commit where the issue started to happen just due to the issue being so unpredictable and random. As mentioned, I have had days where it doesn't happen at all, have had days where it is as bad enough to happen every 10 - 15 minutes, and anywhere in between.

For the disappearing prims issue, I have also noticed this for years and just thought that it was a viewer issue since using the viewer's Rebuild Vertex Buffers option or setting it to wireframe mode and back (Depending on the viewer used) usually brings the prims back, and I see that issue on SL frequently as well after camming around for a bit and resetting cam position back to default.
(0037659)
Ferd Frederix (reporter)
2021-03-14 03:22

What is the router brand/model/rev? Sounds similar to what happens on ActionTech routers from FIOS/Frontier. I believe they have only a 1K UDP buffer.
(0037660)
mewtwo0641 (reporter)
2021-03-14 10:33

@Ferd - It's a Linksys WRT 1900ACS v2
(0037661)
mewtwo0641 (reporter)
2021-03-14 16:31

For time being, I have increased the maximum number of allowed connections as well as the UDP timeout in my router settings, just to give it a try and see if it helps this issue or not.
(0037662)
mewtwo0641 (reporter)
2021-03-14 22:04

The increase in max allowed connections and UDP timeout did not seem to help; had another freeze on grid just a moment ago. The modem and router was also reset earlier today.
(0037676)
mewtwo0641 (reporter)
2021-03-18 23:10

Just to keep a list of commits tried and tested for this issue, I am starting at commit 0c716cbd732608643b7f3ba5e83968e723c2efe6 (7/24/2020), which does exhibit the issue, and working my way backwards until I find a commit that doesn't exhibit the issue.

This will take me a while to do since it is entirely random that the issue shows up. I am going to test each commit that I pick for at least a few days to see if the issue shows up. If the issue shows up, I will go back some more commits and try again. I will repeat this process until I hit a commit that doesn't exhibit the issue for at least a week so that I can be reasonably certain that the issue is not present since the issue can "hide" for a few days, and then show up another day.

I don't think that git bisect will be practical here since it will be such a long running experiment.
(0037677)
tampa (reporter)
2021-03-18 23:28

It's interesting that so far back the issue is still there as I have not heard of issues around that time, it only started in late Feb this year. The changes back in Sep last year did exhibit some really funky behavior in that the compiles did not yield the desired code. I remember for quite a few days it went back and forth updating the libomv to work proper and some other changes as well regarding unwanted behavior.

So I wonder if recent changes simply brought the issue to a greater surface than was otherwise available before. Certainly never fun to debug these kinds of issues.

I have binary builds from pretty much every commit back two years or more so if anyone is willing to help with testing you can find them here: http://archive.opensim.me/files/index.php?dir=binaries%2FOpenSimulator [^]

The respective commits are in the name for reference.
(0037679)
mewtwo0641 (reporter)
2021-03-19 17:16
edited on: 2021-03-19 17:18

@tampa - That is very odd... But that is a good point, I have also noticed other odd possibly related issues such as mantis 0008695 (Animation states getting stuck) as well as teleport issues (refusal of teleports, encountering freezes similar to this issue after arriving in new region, etc.)

Another observation that I have had on this issue is that more users connected tend to make the issue show up more frequently; but I don't have a ton of users connected when it shows up, usually it's just me and 2 others at max.

As a side note, I'm not sure if it matters or not as it pertains to these issues, but the system in question I'm encountering these issues on is running Win 7 x64 with .NET 4.7.2. OpenSim, as well as its MySQL database, is running from an NVMe drive (which is a separate physical drive from the one the operating system is installed on), so disk I/O I would think shouldn't be the problem?

Crystal Disk Mark results for the drive OpenSim/Database runs from:

   Sequential Read (Q= 32,T= 1) : 1623.859 MB/s
  Sequential Write (Q= 32,T= 1) : 1145.047 MB/s
  Random Read 4KiB (Q= 8,T= 8) : 789.648 MB/s [ 192785.2 IOPS]
 Random Write 4KiB (Q= 8,T= 8) : 448.486 MB/s [ 109493.7 IOPS]
  Random Read 4KiB (Q= 32,T= 1) : 442.563 MB/s [ 108047.6 IOPS]
 Random Write 4KiB (Q= 32,T= 1) : 378.621 MB/s [ 92436.8 IOPS]
  Random Read 4KiB (Q= 1,T= 1) : 52.760 MB/s [ 12880.9 IOPS]
 Random Write 4KiB (Q= 1,T= 1) : 193.153 MB/s [ 47156.5 IOPS]

  Test : 1024 MiB [H: 32.4% (77.2/238.5 GiB)] (x5) [Interval=5 sec]

(0037680)
tampa (reporter)
2021-03-19 18:26

I run almost everything on SSD now, apart from a few regions that get little traffic. Disk IO has been getting worse and worse for spinners, not sure why, but generally that should have minor impact on teleports and heartbeat.

Suppose the weird bit is the randomness of it all. Some days I have no issues and with 40+ people constantly moving about hear very little reports then some other day 10 tickets about problems. Normally I'd blame that on network, but with gbit allround and no dropped packets or any other indication... Ubit calls it udp voodoo, but I still believe should be a way to at least attempt an exorcism ;D
(0037681)
mewtwo0641 (reporter)
2021-03-21 13:04
edited on: 2021-03-21 13:09

Currently testing commit 1c4300ff916a12e8a35e0390365c36cc79e8339f (6/28/2020) now. So far so good for 2 days, haven't experienced a freeze, but I am going to continue testing this commit for a while to be certain. I will also start posting the list of commits tested, their commit dates, and the result to the additional info section for easier tracking.

@tampa - Same here, I'm running gigabit across the board with all brand new Cat 6 cables. Everything is hardwired and I'm not attempting to connect over wireless when I see the issues.

From the system running OpenSim I've ran several ping tests with a count of 50 to google.com, my gateway device, and localhost to see if there were any issues with packet loss, but all show up with no packet loss reported. There was < 1ms reported for latency on the gateway and localhost, and an average of 42ms reported for the google.com test, with the lowest reply time being 40ms and the highest being 56ms.

I am connecting to the grid via the LAN IP address of the system OS is running on while I see the freezes happening, and everyone else connects via the WAN IP. It is mostly myself that see the freezes, but I have received several reports of other people on my grid experiencing the freezes at random times as well.

(0037684)
mewtwo0641 (reporter)
2021-03-29 01:59

After testing commit 1c4300ff916a12e8a35e0390365c36cc79e8339f (6/28/2020) for a little over a week, under various network conditions, CPU/RAM loads, and people randomly in and out on the grid, I have not experienced any freezes; so it seems like the freezes started happening sometime between 1c4300ff916a12e8a35e0390365c36cc79e8339f (6/28/2020) and 0c716cbd732608643b7f3ba5e83968e723c2efe6 (7/24/2020)
(0037685)
tampa (reporter)
2021-03-29 05:03

Oh boy that's a bit load of changes, that was the EEP merge iirc

That said, since it started after that there are a few things, a libomv update, which might be the cause, but also commits

f32c0ead050efcc5b1d01ebf5d7d833b6a5ebf09
6a27f3fb207881c394bc6279fa941f9fd6a973ab
402186844c8048eb4cdffd7d5adcc0a0fbe81fd9
25582af3dc80251f166c92ed01b7b68cba7937e0
e0aff5e6403ebac8c6a347df92c90286236813cc
e08ca7402cd8fb81424d80405496d8064bec539d

definitely are suspect here due to their changes to communications. Now that code is too high up there for me to fully grasp, all I can say is it touches relevant areas that potentially cause this soft disconnect.

The commit range though means there is a point of attack now to resolve the problem or at the very least improve the code to remove the potential for issues.
(0037687)
tampa (reporter)
2021-03-29 11:52
edited on: 2021-03-30 09:35

So I finally got a region I can reliably reproduce this on. Switched ports on it in hopes that would help, but nope.

What did help however was teleporting another user in. That seemingly unfreezes the stuck user and they can both walk around just fine.

Think that is a clue at what is happening here, because on teleport the users in the region obviously need to made aware of the presence of another user so information is sent to them, potentially this restarts whatever packetstream was lost in the first place.

EDIT: Further investigations include:

- HUD attached AO script with touch event causes snapping to new location after being stuck in place, essentially a brief scene update to client being sent
- Stuck user seen as online, but IM receive fails and goes to offline IM, but can send IM to other users fine. This is essentially as if the user was on bad HG config agent preferences service.
- Cannot see rezzed items, but they rezzed fine once another user logs in and updates to client resume.
- Change to prims like mentioned HUD AO also resume once other client logs in.

Somehow this points me at two potential things, agent preferences connector or related systems being weird or perhaps, because this only happens on specific regions, relation to YEngine blocking something when only one user is present since it does still register script events and updates the client when they happen.

EDIT 2:

I have begun reverting commits from the range that you said might be containing the cause, if I find something that helps I'll post the relevant patch. At the moment focusing on anything Scene Presence or UserData related since that seems to be the failure, but that is not to say there is other stuff as well. Not to mention the various changes to the oshttp which I don't know how related to this all they might be. Hopefully one of the shots hits the target.

(0037688)
mewtwo0641 (reporter)
2021-03-30 10:48
edited on: 2021-03-30 10:48

@tampa - Glad to hear that we are starting to narrow in on this a little bit! :)

I am going to try testing at commit bb56157c92376845a0e1b7a57fbad805bd45e72f (7/23/2020) which seems to be the commit just before the start of a lot of the OsHttp changes that were made as far as I can tell. Since I know that commit 0c716cbd732608643b7f3ba5e83968e723c2efe6 (7/24/2020) results in a freeze, if I don't experience a freeze at bb5615 then it will leave just 4 commits to be suspect in testing.

(0037689)
tampa (reporter)
2021-03-30 12:20
edited on: 2021-03-30 12:21

I made two prior reverts(01, 02) without much in the way of not getting stuck, just reverted 6fafb7462dae9c47174e54717f3f0ba109dc6fc0 (Revert 03) and went to dinner for over half an hour. Previously would get stuck pretty regularly after 15-20 minutes. Will update should I end up getting stuck yet again.

Said commit is only cosmetic for the most part, well it seems that way anyways, but who knows what "voodoo" goes on.

Attached patches for everything I reverted thus far.

(0037701)
mewtwo0641 (reporter)
2021-04-03 11:01

Currently still testing bb56157c92376845a0e1b7a57fbad805bd45e72f (7/23/2020). There has been no freeze so far after about 4 days, but continuing to keep an eye on it.
(0037702)
tampa (reporter)
2021-04-03 12:39

That commit was just after the one I suspected to be the culprit, strange.

There is some strange behavior that did pop up just recently relating to oshttp, but on master code. Apparently it keeps opening sockets and not closing them, because on numerous occasions now I had regions and robust go crazy over "too many open files" and I already set the fs ulimit to 99 million. I wonder if some of the fixes in oshttp may not have been proper now as well given going back past them in history the problems are more less.

I have yet to test reverting said patches on Robust side of things.

Certainly there is something not quite right with some of the changes made during that period, even if they just now start to show. If I had more time I would run more tests on the region I did see this issue on to try and pinpoint why, when all regions run the same binary, it seems to have been the only place to reliably produce the bug.

Said region does contain data, mesh, scripts, etc. Perhaps some combination there. I wanted to try and piece by piece remove things from it to see when the problem subsides potentially pointing at issues with physics or scripts, but with a couple hundred objects and scripts that's a task for a week given I have to wait 20 each time.

I would certainly hope the potentially problematic commits are being revisited though as currently this issue is pretty breaking when it occurs.
(0037703)
mewtwo0641 (reporter)
2021-04-07 17:04
edited on: 2021-04-07 17:05

After testing bb56157c92376845a0e1b7a57fbad805bd45e72f (7/23/2020) for about a week I have not noticed any freezes. I have noticed that at this commit, log ins seem to not work about 30 - 40% of the time and have to retry log in. So it seems like at least some of these weird issues were just starting to pop up around this commit and got worse after.

My conclusion on this is that the freezes started happening sometime between bb56157c92376845a0e1b7a57fbad805bd45e72f (7/23/2020) and 0c716cbd732608643b7f3ba5e83968e723c2efe6 (7/24/2020).

@tampa - I am unsure about your issues showing up prior to that commit; we could possibly be talking about slightly different issues maybe?

I am going to start testing again on latest master now to see if the freezes return.

(0037704)
tampa (reporter)
2021-04-07 22:16

Well the commit you are testing is a rather minimal change concerning flotsam cache. After that commit are some commits I had originally deemed suspect of potentially being the cause of the issues. The commit I reverted is the one directly prior to the one you are testing.

The region that showed this issue with master code has been running that revert 03 for a week now as well and I have not heard of further issues with it. Not sure about logins, failures there are not known to me at this point.

I believe e0aff5e6403ebac8c6a347df92c90286236813cc was one of the reverts I made since it concerns scenepresence which seemingly got lost teleporting between regions, but that commit is currently not reverted and that issue has not been reported.

When I find the time I will revert the region back to the master binary and see if the issue returns, then if time really is generous for once remove an object at a time to see if perhaps scripts or physics are the trigger for this.

Ideally now that potentially bad commits are identified they should be reviewed with potential references or point of failure being tested against directly in code. Meaning making absolutely sure they inadvertently break code elsewhere it might be referenced or used.
(0037708)
mewtwo0641 (reporter)
2021-04-12 18:21

I am a little stumped now, after about 4 days testing on master, I haven't encountered a freeze yet. So I may have just gotten really lucky testing after the commit you mentioned being suspect. We might have to look into some more aggressive testing when either of us finds the time to; although at this moment I am unsure what that may entail.

As a heads up, I will be unavailable to do a whole lot of testing for a little while since I have a surgery that needs to be dealt with before I'm feeling well enough to start digging into it. But I will be back soon as I can :)
(0037709)
tampa (reporter)
2021-04-13 02:12

Good luck and a speedy recovery. When you are back feel free to poke me on IRC to get the gritty testing going(I keep getting distracted forgetting this ticket exists)
(0037710)
UbitUmarov (administrator)
2021-04-13 05:31

Yeah, good luck and a speedy recovery.
(0037715)
mewtwo0641 (reporter)
2021-04-15 12:16

@Ubit, @tampa - Thank you! Everything went well, I will be back in touch once I am feeling better.

- Issue History
Date Modified Username Field Change
2021-03-07 17:16 mewtwo0641 New Issue
2021-03-07 17:18 mewtwo0641 Description Updated View Revisions
2021-03-07 23:40 tampa Note Added: 0037605
2021-03-08 00:27 BillBlight Note Added: 0037606
2021-03-08 01:43 mewtwo0641 Note Added: 0037607
2021-03-08 07:59 UbitUmarov Note Added: 0037608
2021-03-08 10:22 mewtwo0641 Note Added: 0037609
2021-03-08 10:27 UbitUmarov Note Added: 0037610
2021-03-08 12:55 tampa Note Added: 0037611
2021-03-09 11:26 tampa Note Added: 0037612
2021-03-09 13:27 mewtwo0641 Note Added: 0037614
2021-03-14 03:22 Ferd Frederix Note Added: 0037659
2021-03-14 10:33 mewtwo0641 Note Added: 0037660
2021-03-14 16:31 mewtwo0641 Note Added: 0037661
2021-03-14 22:04 mewtwo0641 Note Added: 0037662
2021-03-18 23:10 mewtwo0641 Note Added: 0037676
2021-03-18 23:28 tampa Note Added: 0037677
2021-03-19 17:16 mewtwo0641 Note Added: 0037679
2021-03-19 17:18 mewtwo0641 Note Edited: 0037679 View Revisions
2021-03-19 18:26 tampa Note Added: 0037680
2021-03-21 13:04 mewtwo0641 Note Added: 0037681
2021-03-21 13:07 mewtwo0641 Additional Information Updated View Revisions
2021-03-21 13:09 mewtwo0641 Note Edited: 0037681 View Revisions
2021-03-29 01:59 mewtwo0641 Note Added: 0037684
2021-03-29 02:00 mewtwo0641 Additional Information Updated View Revisions
2021-03-29 05:03 tampa Note Added: 0037685
2021-03-29 11:52 tampa Note Added: 0037687
2021-03-29 16:34 tampa Note Edited: 0037687 View Revisions
2021-03-30 09:35 tampa Note Edited: 0037687 View Revisions
2021-03-30 10:48 mewtwo0641 Note Added: 0037688
2021-03-30 10:48 mewtwo0641 Note Edited: 0037688 View Revisions
2021-03-30 12:20 tampa Note Added: 0037689
2021-03-30 12:21 tampa File Added: 0071-Revert-01.patch
2021-03-30 12:21 tampa File Added: 0072-Revert-02.patch
2021-03-30 12:21 tampa File Added: 0073-Revert-03.patch
2021-03-30 12:21 tampa Note Edited: 0037689 View Revisions
2021-04-03 11:01 mewtwo0641 Additional Information Updated View Revisions
2021-04-03 11:01 mewtwo0641 Note Added: 0037701
2021-04-03 12:39 tampa Note Added: 0037702
2021-04-07 16:59 mewtwo0641 Additional Information Updated View Revisions
2021-04-07 17:04 mewtwo0641 Note Added: 0037703
2021-04-07 17:05 mewtwo0641 Note Edited: 0037703 View Revisions
2021-04-07 22:16 tampa Note Added: 0037704
2021-04-12 18:21 mewtwo0641 Note Added: 0037708
2021-04-13 02:12 tampa Note Added: 0037709
2021-04-13 05:31 UbitUmarov Note Added: 0037710
2021-04-15 12:16 mewtwo0641 Note Added: 0037715


Copyright © 2000 - 2012 MantisBT Group
Powered by Mantis Bugtracker