Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0008873opensim[REGION] OpenSim Corepublic2021-03-07 17:162021-10-15 14:24
Reportermewtwo0641 
Assigned To 
PrioritynormalSeveritymajorReproducibilityrandom
StatusnewResolutionopen 
PlatformOperating SystemOperating System Version
Product Version 
Target VersionFixed in Version 
Summary0008873: Viewer stops receiving updates from sim at random
DescriptionOn master, the viewer seems to randomly stop receiving updates from the sim. This is characterized by the avatar not being able to move (can only spin on the spot), not being able to see chat in general chat, no longer see other users log on/off, etc.

It doesn't seem to be a complete disconnect because IM still works, and other people can see you move (even though it appears to you as you're stuck on the spot), and the viewer doesn't give you the "you've been disconnected" message.

This started happening within the past 2 or 3 months of commits on master branch. This wasn't an issue before this time period, or at least, it happened very infrequently, but now it's happening many times a day, and the only fix is to relog which only fixes the issue temporarily.

I have gone long periods of time without realizing that it was in that semi-frozen state until someone IMs me and asks me to look at something they're working on for my opinion on it, and I'm really confused, because to me, it looks like they're just standing there doing nothing.

The issue doesn't seem to be viewer specific, I have tried latest Firestorm and latest Singularity Viewer.
Steps To ReproduceNo known steps to reproduce at the time; the issue is random
Additional InformationCommits tested:

915bc74ab40918294f4b0d52a9e77bf9eed2ec1a (2/16/2021) - Freeze

0c716cbd732608643b7f3ba5e83968e723c2efe6 (7/24/2020) - Freeze

bb56157c92376845a0e1b7a57fbad805bd45e72f (7/23/2020) - No freeze after testing for a little over a week under various network conditions, CPU/RAM loads, and people randomly connecting to the grid.

1c4300ff916a12e8a35e0390365c36cc79e8339f (6/28/2020) - No freeze after testing for a little over a week under various network conditions, CPU/RAM loads, and people randomly connecting to the grid.

--------------------------------------------------------------

It seems that freezes started happening sometime between 1c4300ff916a12e8a35e0390365c36cc79e8339f (6/28/2020) and 0c716cbd732608643b7f3ba5e83968e723c2efe6 (7/24/2020)
TagsNo tags attached.
Git Revision or version number
Run Mode Grid (Multiple Regions per Sim)
Physics EngineubODE
Script EngineYEngine
Environment.NET / Windows64
Mono VersionNone
Viewer
Attached Filespatch file icon 0071-Revert-01.patch [^] (1,157 bytes) 2021-03-30 12:21 [Show Content]
patch file icon 0072-Revert-02.patch [^] (6,553 bytes) 2021-03-30 12:21 [Show Content]
patch file icon 0073-Revert-03.patch [^] (3,582 bytes) 2021-03-30 12:21 [Show Content]

- Relationships

-  Notes
(0037605)
tampa (reporter)
2021-03-07 23:40

Region console reporting and resent packets?
Tried a different port in the viewer networking?
Take a look at your local network, see if anything is spamming it and maybe disconnect any other computer. Give the router/modem a kick and check QoS setup.
(0037606)
BillBlight (developer)
2021-03-08 00:27

Also some recent AV/Firewall updates seem to to think some opensim traffic is a type of network attack and blocks it ..
(0037607)
mewtwo0641 (reporter)
2021-03-08 01:43

@tampa - The console does report on duplicated packets, these usually only show up upon login and stop after a few moments. I am unsure why this happens. I have noticed this for a long time though, before the freezing issue was a problem.

Have not tried a different port on viewer yet

Nothing seems abnormal on the network, router bandwidth graphs don't show any sudden major spikes, and task manager network graphs on all computers show low use most of the time (< 1%) unless I am purposefully doing something such as downloading a large file; I am usually not doing this when the freezes occur though (A few times it has happened when I was reading a book and all my systems were idle network wise with the exception for OpenSim).

I did find out a few days ago that the NIC on the system running OpenSim was causing odd issues with spiking the CPU up really hard for extended periods of time at random, so I replaced the card thinking perhaps that was causing the freezing issue... The CPU spiking issue went away after replacement, but the freezes on OpenSim still occur.

QoS is disabled in router setup

Router and modem have been restarted a number of times over the course of the past few months and noticing the issue, but issue still persists. Everything else works fine on the network with no issue.

The system in question has also recently had the operating system reinstalled from scratch as a means for troubleshooting, issue was occurring before the reinstall, and still occurs after the reinstall.

@Bill - I tried testing with AV and Firewall disabled (their processes not running at all) but the issue still occurs

--------------------

As a side note, the system running OpenSim is on a LAN local to the computers connecting to it, with several users connecting to it at any point over WAN. I haven't had the WAN users report on the freezing (maybe they haven't noticed, or maybe they have, and just relogged without a second thought and didn't report on it.), but me personally, connecting over LAN, I do see the issue quite often.
(0037608)
UbitUmarov (administrator)
2021-03-08 07:59

other thing i can't currently repo.
both on linux(remote sim) and win (local machine sim)
(0037609)
mewtwo0641 (reporter)
2021-03-08 10:22

It is very random. Some days it will be fine and will never experience the issue, other days it may only happen once or twice, and still other days it will happen many times throughout the day.

It might be just coincidence but one thing I have noticed is, one of the users that log onto the grid has an unstable ISP (which can't be helped, it's the only ISP option they have right now), and when they are having "a bad ISP day" and are constantly dropping packets, is when I seem to notice the freezes happen for me a lot more often. But that doesn't make any sense to me... Why would their unstable connection cause problems for other people on the sim? And why would that cause issues with the sim now and not a few months back in commits?
(0037610)
UbitUmarov (administrator)
2021-03-08 10:27

a bad connection can mean more traffic with udp (and low level tcp) retries back and fw.
Viewers even do very silly( should say stupid?) retries on HTTP!!!
Impact on others depends on how busy is the server and its network.
Should not be that high impact (?)
(0037611)
tampa (reporter)
2021-03-08 12:55

I also still get reports of some, not all, regions being weird, refusing teleports, desync or weird behavior of groupchat(granted that's probably not directly related).

On master the changes since Feb 12th seemed to make the situation worse, I reverted most of it and that seemed to help a bit, but not entirely. I see no packet loss for myself, but yet get stuck not sending presence to new simulator until after the timeout for said transfer is reached.

Haven't had the time to fully investigate this further and determine if master has issues more or less.

Best approach is now to try and find a way to determine since exactly when issues began in an attempt to pinpoint commits that, when reverted, reduce or eliminate the issue. Work from there what the heck the compiler broke this time. One of those "it compiles but doesn't work" issues unfortunately.

Another option would be to try and add additional nunit tests to give the potential points of failure more scrutiny, if only to test for more failure points.
(0037612)
tampa (reporter)
2021-03-09 11:26

Do have one thing to add: https://www.youtube.com/watch?v=RiBG07uA23Q [^]

This has been going on for years though, sometimes it seems the viewer just doesn't communicate even such minor things as prim scale and position. I can replicate this fairly well when prims have sat on a region for a couple days.
(0037614)
mewtwo0641 (reporter)
2021-03-09 13:27

I think that it might be quite difficult to pin down an exact commit where the issue started to happen just due to the issue being so unpredictable and random. As mentioned, I have had days where it doesn't happen at all, have had days where it is as bad enough to happen every 10 - 15 minutes, and anywhere in between.

For the disappearing prims issue, I have also noticed this for years and just thought that it was a viewer issue since using the viewer's Rebuild Vertex Buffers option or setting it to wireframe mode and back (Depending on the viewer used) usually brings the prims back, and I see that issue on SL frequently as well after camming around for a bit and resetting cam position back to default.
(0037659)
Ferd Frederix (reporter)
2021-03-14 03:22

What is the router brand/model/rev? Sounds similar to what happens on ActionTech routers from FIOS/Frontier. I believe they have only a 1K UDP buffer.
(0037660)
mewtwo0641 (reporter)
2021-03-14 10:33

@Ferd - It's a Linksys WRT 1900ACS v2
(0037661)
mewtwo0641 (reporter)
2021-03-14 16:31

For time being, I have increased the maximum number of allowed connections as well as the UDP timeout in my router settings, just to give it a try and see if it helps this issue or not.
(0037662)
mewtwo0641 (reporter)
2021-03-14 22:04

The increase in max allowed connections and UDP timeout did not seem to help; had another freeze on grid just a moment ago. The modem and router was also reset earlier today.
(0037676)
mewtwo0641 (reporter)
2021-03-18 23:10

Just to keep a list of commits tried and tested for this issue, I am starting at commit 0c716cbd732608643b7f3ba5e83968e723c2efe6 (7/24/2020), which does exhibit the issue, and working my way backwards until I find a commit that doesn't exhibit the issue.

This will take me a while to do since it is entirely random that the issue shows up. I am going to test each commit that I pick for at least a few days to see if the issue shows up. If the issue shows up, I will go back some more commits and try again. I will repeat this process until I hit a commit that doesn't exhibit the issue for at least a week so that I can be reasonably certain that the issue is not present since the issue can "hide" for a few days, and then show up another day.

I don't think that git bisect will be practical here since it will be such a long running experiment.
(0037677)
tampa (reporter)
2021-03-18 23:28

It's interesting that so far back the issue is still there as I have not heard of issues around that time, it only started in late Feb this year. The changes back in Sep last year did exhibit some really funky behavior in that the compiles did not yield the desired code. I remember for quite a few days it went back and forth updating the libomv to work proper and some other changes as well regarding unwanted behavior.

So I wonder if recent changes simply brought the issue to a greater surface than was otherwise available before. Certainly never fun to debug these kinds of issues.

I have binary builds from pretty much every commit back two years or more so if anyone is willing to help with testing you can find them here: http://archive.opensim.me/files/index.php?dir=binaries%2FOpenSimulator [^]

The respective commits are in the name for reference.
(0037679)
mewtwo0641 (reporter)
2021-03-19 17:16
edited on: 2021-03-19 17:18

@tampa - That is very odd... But that is a good point, I have also noticed other odd possibly related issues such as mantis 0008695 (Animation states getting stuck) as well as teleport issues (refusal of teleports, encountering freezes similar to this issue after arriving in new region, etc.)

Another observation that I have had on this issue is that more users connected tend to make the issue show up more frequently; but I don't have a ton of users connected when it shows up, usually it's just me and 2 others at max.

As a side note, I'm not sure if it matters or not as it pertains to these issues, but the system in question I'm encountering these issues on is running Win 7 x64 with .NET 4.7.2. OpenSim, as well as its MySQL database, is running from an NVMe drive (which is a separate physical drive from the one the operating system is installed on), so disk I/O I would think shouldn't be the problem?

Crystal Disk Mark results for the drive OpenSim/Database runs from:

   Sequential Read (Q= 32,T= 1) : 1623.859 MB/s
  Sequential Write (Q= 32,T= 1) : 1145.047 MB/s
  Random Read 4KiB (Q= 8,T= 8) : 789.648 MB/s [ 192785.2 IOPS]
 Random Write 4KiB (Q= 8,T= 8) : 448.486 MB/s [ 109493.7 IOPS]
  Random Read 4KiB (Q= 32,T= 1) : 442.563 MB/s [ 108047.6 IOPS]
 Random Write 4KiB (Q= 32,T= 1) : 378.621 MB/s [ 92436.8 IOPS]
  Random Read 4KiB (Q= 1,T= 1) : 52.760 MB/s [ 12880.9 IOPS]
 Random Write 4KiB (Q= 1,T= 1) : 193.153 MB/s [ 47156.5 IOPS]

  Test : 1024 MiB [H: 32.4% (77.2/238.5 GiB)] (x5) [Interval=5 sec]

(0037680)
tampa (reporter)
2021-03-19 18:26

I run almost everything on SSD now, apart from a few regions that get little traffic. Disk IO has been getting worse and worse for spinners, not sure why, but generally that should have minor impact on teleports and heartbeat.

Suppose the weird bit is the randomness of it all. Some days I have no issues and with 40+ people constantly moving about hear very little reports then some other day 10 tickets about problems. Normally I'd blame that on network, but with gbit allround and no dropped packets or any other indication... Ubit calls it udp voodoo, but I still believe should be a way to at least attempt an exorcism ;D
(0037681)
mewtwo0641 (reporter)
2021-03-21 13:04
edited on: 2021-03-21 13:09

Currently testing commit 1c4300ff916a12e8a35e0390365c36cc79e8339f (6/28/2020) now. So far so good for 2 days, haven't experienced a freeze, but I am going to continue testing this commit for a while to be certain. I will also start posting the list of commits tested, their commit dates, and the result to the additional info section for easier tracking.

@tampa - Same here, I'm running gigabit across the board with all brand new Cat 6 cables. Everything is hardwired and I'm not attempting to connect over wireless when I see the issues.

From the system running OpenSim I've ran several ping tests with a count of 50 to google.com, my gateway device, and localhost to see if there were any issues with packet loss, but all show up with no packet loss reported. There was < 1ms reported for latency on the gateway and localhost, and an average of 42ms reported for the google.com test, with the lowest reply time being 40ms and the highest being 56ms.

I am connecting to the grid via the LAN IP address of the system OS is running on while I see the freezes happening, and everyone else connects via the WAN IP. It is mostly myself that see the freezes, but I have received several reports of other people on my grid experiencing the freezes at random times as well.

(0037684)
mewtwo0641 (reporter)
2021-03-29 01:59

After testing commit 1c4300ff916a12e8a35e0390365c36cc79e8339f (6/28/2020) for a little over a week, under various network conditions, CPU/RAM loads, and people randomly in and out on the grid, I have not experienced any freezes; so it seems like the freezes started happening sometime between 1c4300ff916a12e8a35e0390365c36cc79e8339f (6/28/2020) and 0c716cbd732608643b7f3ba5e83968e723c2efe6 (7/24/2020)
(0037685)
tampa (reporter)
2021-03-29 05:03

Oh boy that's a bit load of changes, that was the EEP merge iirc

That said, since it started after that there are a few things, a libomv update, which might be the cause, but also commits

f32c0ead050efcc5b1d01ebf5d7d833b6a5ebf09
6a27f3fb207881c394bc6279fa941f9fd6a973ab
402186844c8048eb4cdffd7d5adcc0a0fbe81fd9
25582af3dc80251f166c92ed01b7b68cba7937e0
e0aff5e6403ebac8c6a347df92c90286236813cc
e08ca7402cd8fb81424d80405496d8064bec539d

definitely are suspect here due to their changes to communications. Now that code is too high up there for me to fully grasp, all I can say is it touches relevant areas that potentially cause this soft disconnect.

The commit range though means there is a point of attack now to resolve the problem or at the very least improve the code to remove the potential for issues.
(0037687)
tampa (reporter)
2021-03-29 11:52
edited on: 2021-03-30 09:35

So I finally got a region I can reliably reproduce this on. Switched ports on it in hopes that would help, but nope.

What did help however was teleporting another user in. That seemingly unfreezes the stuck user and they can both walk around just fine.

Think that is a clue at what is happening here, because on teleport the users in the region obviously need to made aware of the presence of another user so information is sent to them, potentially this restarts whatever packetstream was lost in the first place.

EDIT: Further investigations include:

- HUD attached AO script with touch event causes snapping to new location after being stuck in place, essentially a brief scene update to client being sent
- Stuck user seen as online, but IM receive fails and goes to offline IM, but can send IM to other users fine. This is essentially as if the user was on bad HG config agent preferences service.
- Cannot see rezzed items, but they rezzed fine once another user logs in and updates to client resume.
- Change to prims like mentioned HUD AO also resume once other client logs in.

Somehow this points me at two potential things, agent preferences connector or related systems being weird or perhaps, because this only happens on specific regions, relation to YEngine blocking something when only one user is present since it does still register script events and updates the client when they happen.

EDIT 2:

I have begun reverting commits from the range that you said might be containing the cause, if I find something that helps I'll post the relevant patch. At the moment focusing on anything Scene Presence or UserData related since that seems to be the failure, but that is not to say there is other stuff as well. Not to mention the various changes to the oshttp which I don't know how related to this all they might be. Hopefully one of the shots hits the target.

(0037688)
mewtwo0641 (reporter)
2021-03-30 10:48
edited on: 2021-03-30 10:48

@tampa - Glad to hear that we are starting to narrow in on this a little bit! :)

I am going to try testing at commit bb56157c92376845a0e1b7a57fbad805bd45e72f (7/23/2020) which seems to be the commit just before the start of a lot of the OsHttp changes that were made as far as I can tell. Since I know that commit 0c716cbd732608643b7f3ba5e83968e723c2efe6 (7/24/2020) results in a freeze, if I don't experience a freeze at bb5615 then it will leave just 4 commits to be suspect in testing.

(0037689)
tampa (reporter)
2021-03-30 12:20
edited on: 2021-03-30 12:21

I made two prior reverts(01, 02) without much in the way of not getting stuck, just reverted 6fafb7462dae9c47174e54717f3f0ba109dc6fc0 (Revert 03) and went to dinner for over half an hour. Previously would get stuck pretty regularly after 15-20 minutes. Will update should I end up getting stuck yet again.

Said commit is only cosmetic for the most part, well it seems that way anyways, but who knows what "voodoo" goes on.

Attached patches for everything I reverted thus far.

(0037701)
mewtwo0641 (reporter)
2021-04-03 11:01

Currently still testing bb56157c92376845a0e1b7a57fbad805bd45e72f (7/23/2020). There has been no freeze so far after about 4 days, but continuing to keep an eye on it.
(0037702)
tampa (reporter)
2021-04-03 12:39

That commit was just after the one I suspected to be the culprit, strange.

There is some strange behavior that did pop up just recently relating to oshttp, but on master code. Apparently it keeps opening sockets and not closing them, because on numerous occasions now I had regions and robust go crazy over "too many open files" and I already set the fs ulimit to 99 million. I wonder if some of the fixes in oshttp may not have been proper now as well given going back past them in history the problems are more less.

I have yet to test reverting said patches on Robust side of things.

Certainly there is something not quite right with some of the changes made during that period, even if they just now start to show. If I had more time I would run more tests on the region I did see this issue on to try and pinpoint why, when all regions run the same binary, it seems to have been the only place to reliably produce the bug.

Said region does contain data, mesh, scripts, etc. Perhaps some combination there. I wanted to try and piece by piece remove things from it to see when the problem subsides potentially pointing at issues with physics or scripts, but with a couple hundred objects and scripts that's a task for a week given I have to wait 20 each time.

I would certainly hope the potentially problematic commits are being revisited though as currently this issue is pretty breaking when it occurs.
(0037703)
mewtwo0641 (reporter)
2021-04-07 17:04
edited on: 2021-04-07 17:05

After testing bb56157c92376845a0e1b7a57fbad805bd45e72f (7/23/2020) for about a week I have not noticed any freezes. I have noticed that at this commit, log ins seem to not work about 30 - 40% of the time and have to retry log in. So it seems like at least some of these weird issues were just starting to pop up around this commit and got worse after.

My conclusion on this is that the freezes started happening sometime between bb56157c92376845a0e1b7a57fbad805bd45e72f (7/23/2020) and 0c716cbd732608643b7f3ba5e83968e723c2efe6 (7/24/2020).

@tampa - I am unsure about your issues showing up prior to that commit; we could possibly be talking about slightly different issues maybe?

I am going to start testing again on latest master now to see if the freezes return.

(0037704)
tampa (reporter)
2021-04-07 22:16

Well the commit you are testing is a rather minimal change concerning flotsam cache. After that commit are some commits I had originally deemed suspect of potentially being the cause of the issues. The commit I reverted is the one directly prior to the one you are testing.

The region that showed this issue with master code has been running that revert 03 for a week now as well and I have not heard of further issues with it. Not sure about logins, failures there are not known to me at this point.

I believe e0aff5e6403ebac8c6a347df92c90286236813cc was one of the reverts I made since it concerns scenepresence which seemingly got lost teleporting between regions, but that commit is currently not reverted and that issue has not been reported.

When I find the time I will revert the region back to the master binary and see if the issue returns, then if time really is generous for once remove an object at a time to see if perhaps scripts or physics are the trigger for this.

Ideally now that potentially bad commits are identified they should be reviewed with potential references or point of failure being tested against directly in code. Meaning making absolutely sure they inadvertently break code elsewhere it might be referenced or used.
(0037708)
mewtwo0641 (reporter)
2021-04-12 18:21

I am a little stumped now, after about 4 days testing on master, I haven't encountered a freeze yet. So I may have just gotten really lucky testing after the commit you mentioned being suspect. We might have to look into some more aggressive testing when either of us finds the time to; although at this moment I am unsure what that may entail.

As a heads up, I will be unavailable to do a whole lot of testing for a little while since I have a surgery that needs to be dealt with before I'm feeling well enough to start digging into it. But I will be back soon as I can :)
(0037709)
tampa (reporter)
2021-04-13 02:12

Good luck and a speedy recovery. When you are back feel free to poke me on IRC to get the gritty testing going(I keep getting distracted forgetting this ticket exists)
(0037710)
UbitUmarov (administrator)
2021-04-13 05:31

Yeah, good luck and a speedy recovery.
(0037715)
mewtwo0641 (reporter)
2021-04-15 12:16

@Ubit, @tampa - Thank you! Everything went well, I will be back in touch once I am feeling better.
(0037733)
Frank Hurt (reporter)
2021-05-07 01:32

I just wanted to chime in as I've been following this particular bug for some time. It's affecting my little Dreamgrid installation on a Contabo VPS and it happens frequently. Generally, we've observed it happening when there's more than one avatar present *and* at least one avatar is rezzing objects.

The avatar most likely to freeze up is mine (the grid owner) though when I logged out and then back in with a recently-added alt, one of my residents with an older account became the person who kept freezing up (viewer updates froze).

Almost as though it was a "hot potato" being passed by default to the most senior/oldest avatar account.

We've experimented with a variety of viewers and versions and this issue does seem to be independent of that.

Restarting the grid appeared to help for a short while.

Anything I can do to help identify this bug, please let me know. We experience it often enough that while random, it's frequent.
(0037750)
mewtwo0641 (reporter)
2021-05-29 01:52
edited on: 2021-05-29 01:54

I am semi-back now. I can confirm with Frank that the freezes seem to correlate with others editing/rezzing objects (Unsure which is the possible trigger).

There are also a couple other suspect commits that I have reverted and been testing without them but have not resolved this issue:

3e5813a0c3841233bfa195e3ca12eac373f54f1a
e86bd042bb06fcc09c38088f3c08bde4cc1c72a2

The reason I had previously suspected them is because they cause other network performance issues (which they probably need mantises of their own):

3e5813a0c3841233bfa195e3ca12eac373f54f1a - Causes mesh attachments to not render for an extremely long time upon login on some viewers (Have not tested all viewers, but I know it happens on Singularity Viewer)

e86bd042bb06fcc09c38088f3c08bde4cc1c72a2 - Causes HTTP (Might be other) operations to pause for appx. 500 ms appx. every 500 ms. It's not so bad when it's a few operations, but the delays add up quickly when it's a lot of operations, causing things like IAR load to take an extremely long time to finish compared to before this commit was introduced.

(0037752)
tampa (reporter)
2021-05-29 02:21

These all mention some form of http, which I am beginning to get suspect of. The errors I encounter as of late, while different, all at somepoint interface with http xmlrpc. I wonder if there is something fundamentally flawed there that only shows its ugly head in specific cases.

The common thing seems to be something in regards to getting and handling responses.

Just the other day agent prefs over HG failed because of GetboolResponse even though the connection worked and a response as given, it just wasn't "complete"? It's confusing, but the relationship between all those is pointing more and more into the same direction.

I sound like a broken record, but something clearly is no longer communicating as it is supposed to, the protocol is no longer fully compliant in some cases and the changes in the last few months seem to be the cause, so the entire effort of bringing the codebase up to newer standards for http handling etc. are suspect. Although I do still think this is something underneath and each unearthed change we find is just changing over to that broken version.

So either everything gets that treatment now and we'll see if it all blows up or roll back each suspect commit and compare the result to master. Both is labor intensive and likely does not yield much information, but what other option is there to get to the bottom of this finally? Least, think we can agree on that, we don't want to push a 9.2 release that makes half the metaverse lose connections.
(0038002)
mewtwo0641 (reporter)
2021-10-10 05:03

I have noticed a couple of things recently related to this issue:

A friend and I decided to start building a new region up in the past couple of days so we have been building and editing stuff on the region quite often. Both of us have experienced the viewer freeze up described by this mantis about every 5 or 10 minutes or so. Once we stopped building and editing for the day and decided to take time to relax and just chat, the freezes quit happening. This seems to me to be further confirmation (and hopefully not just anecdotal) of a previous suspicion noted about building/editing being the trigger or at least contributing to the freeze issue.

Something new that I have noticed today is that if you can manage to teleport to another region, and then back to the region you froze in, it will "unstick" itself and be fine, until the freeze happens again. So a relog isn't strictly always necessary, but teleports aren't always successful for whatever reason, and in that case, a relog is necessary.
(0038003)
tampa (reporter)
2021-10-11 03:03

Had a second region begin getting avatars stuck on it, on the same machine, which led me to do some digging.

It seems there are quite a few differences in what packages are installed on the machines I run. More interesting still, according to proc maps OpenSim itself loads different libraries.

The region that is affected loads /usr/lib/x86_64-linux-gnu/libnss_mdns4_minimal.so.2 but does not load /usr/lib/libmono-btls-shared.so, which is loaded by another machine that has yet to experience this issue.

I cloned the region via oar to the other machine and could not reproduce the issue there.

I wonder if minute differences in the system libraries beyond mono may result in issues.

Had the same issue with a region running on my own box, which runs Win10 now. So whatever bug there is seems to be at least system agnostic enough to cause issues on both nix and win. Annoyingly win doesn't give you the loaded libraries of a program so easily, so there is no way to compare whether there might be equivalent "missing" libraries not being loaded.

It should be unlikely that this is the cause though given the libraries are available and can be loaded just fine, so the difference in loaded libraries is strange, all else being equal. Also, aren't libraries loaded dynamically by OpenSim?

I did put back the reverted code from Revert 03 on the affected region to see if that, like the last time this happened, has any positive effects. If that doesn't work out then it adds to my suspicion that perhaps this bug is rooted much further down somewhere and that recent code cleanup only unearthed the issue ever more.

As noted in the other recent ticket, there seem to still be "disconnects" between OpenSim and viewer that lead to either updates not being sent or the viewer perhaps not interpreting them correctly, but something is clearly amiss with the OpenSim-Viewer communication. Though fixing this... this will be fun.
(0038009)
Kubwa (reporter)
2021-10-14 09:30
edited on: 2021-10-14 09:38

I also noticed, that, when a freez happens, the cpu usage for this region gets to 100%. So the opensim.exe process of that simulator (one region per opensim.exe) uses one entire core.
I can't see anything in the console about that. I'am using windows 10.
CPU usage gets back to normal when the avatar, which is hanging, is logging off or is leaving the region.

As you can see, the processes memory usage raises quickly too (green) when the cpu starts to work (blue): https://snap.kubwa.de/?f=_zkcht6zj9mxbk8u [^]

(0038018)
Kubwa (reporter)
2021-10-15 10:05
edited on: 2021-10-15 11:13

I just saw the problem happening again on my region after i moved an object with hold shift key to make a copy.
I also noticed that the freezed person can interact with other people in chat and voice freely, the person can see changes in terrain, parcel settings, can rez a prim (which i can see but the freezed person cant). The person cant see itself nor me moving, but i can see him moving around.

So it must be something with updates to solid objects inworld.

All freezes were in combination with a high cpu usage by the affected region.

----
Another test:

(The avatar, who was freezed in this tests was the latest avatar entering the region, and only this avatar freezed)

Ok, i continued testing. Adding objects to a scene and changing objects are forcing the same bug. So a general scene change is causing that.

I first created a bunch of regular box prims to check that: Freeze after about 4 minutes and about 50 created primitives.

- relog of freezed ava -

Secondly i just moved these prims around, not adding or removing of primitives. same result: after about 4 minutes, freezed.

- relog freezed ava -

Thirdly i created a script that rezzes new prims every 2 seconds to check if it happens only for avatar interaction or also when a script is rezzing: no freeze after about 10 minutes, when at all other tests, the freeze happend in below 5 minutes.

After these said 10 minutes, i started interacting again and removed the rezzed prims one by one. freeze after a few seconds.

---
Nothing is happening when iam alone on the region. I'am not freezing, never. It only seem to happend with more than one avatar on the region.


== This bug is absolutely reproducable. Just invite another avatar to your region and start rezzing a lot of prims. After a few minutes the other avatar will freez.

(0038020)
tampa (reporter)
2021-10-15 11:41

That seems to be a different issue. Perhaps what's being talked about here are multiple different issues.

Specifically what was fixed recently was a bug that seemed to occur when threads were killed while they were being tasked with something to do. Some timing issue fixed by properly locking things so they could not be messed with when already in the process of doing something.

That seemed to specifically resolve the issue of not being able to move after a couple minutes of being the only avatar on a region. The trigger for this timing issue is a complete mystery. Out of hundreds of regions only one might be affected, but that region, even if duplicated with an OAR might still exhibit the issue. I suppose somewhere all the different bits align to cause this.

The other issue described here seems to also relate to broken presence or loss of presence. This was originally fixed, but seems to still not be entirely resolved either.

It may still all be related so make sure to test on current master. There are various places that also have debug log that can be set via log level to determine what goes on. Additionally adding more debug in the code might be necessary, it was in my case.
(0038021)
Kubwa (reporter)
2021-10-15 11:43
edited on: 2021-10-15 12:13

@tampa: so going to the latest dev might fix my described issue?

---
Ok, switching to the latest dev does not fix my described problem. so maybe its related?

(0038022)
tampa (reporter)
2021-10-15 14:08

Means that your issue is not being touched by the changes. What confuses me a little bit is the part about this happening only when there is more than one avatar on the region. Let's get some baseline variables out the way to track this down first.

Have you experienced this on a specific region or any/all regions?

Are you using the basic OpenSim configuration or did you make changes?

Is this reproducible only in a specific environment, meaning grid, operating system etc?

I have been building around on a region for the last few hours while someone has been watching and we both can still move about.

Since these issues seem to all revolve around the same thing, somehow viewer and OpenSim losing sync or not operating concurrently together the likelihood of relation is there. I would not be surprised if there are other locks in the code that are not just not encompassing enough to prevent certain operations from breaking due to minute timing changes, but diagnosis is difficult.

Originally got closer to the issue by finding the rough time it started happening and correlating that to the commits made, which I then went and reverted one by one until the issue went away. From there we started digging ending up at a JobEngine locking issue, a line of code that needed a bit more encouragement than it should, but such is the quirks of timing in async threads.

Can you attempt to figure out a timeframe or range of commits this started with?
(0038023)
Kubwa (reporter)
2021-10-15 14:17

Have you experienced this on a specific region or any/all regions?
> I saw this on different regions in my grid and on other grids. All of these
> grids were running a dev version from around april this year under windows.

Are you using the basic OpenSim configuration or did you make changes?
> I made some small changes. Just edited map section and yengine.

Is this reproducible only in a specific environment, meaning grid, operating system etc?
> i have to check that. I have absolutely no expierience with unix based systems. I need some time to get a test environment up and running.
(0038024)
tampa (reporter)
2021-10-15 14:24

I can supply you a region to test with to see if you can reproduce it or you can try to get it reproduced on the Sandbox over in ZetaWorlds.

It being windows would suggest potential in .net having a different kind of timing issue, missing locks somewhere since clocks work slightly differently.

- Issue History
Date Modified Username Field Change
2021-03-07 17:16 mewtwo0641 New Issue
2021-03-07 17:18 mewtwo0641 Description Updated View Revisions
2021-03-07 23:40 tampa Note Added: 0037605
2021-03-08 00:27 BillBlight Note Added: 0037606
2021-03-08 01:43 mewtwo0641 Note Added: 0037607
2021-03-08 07:59 UbitUmarov Note Added: 0037608
2021-03-08 10:22 mewtwo0641 Note Added: 0037609
2021-03-08 10:27 UbitUmarov Note Added: 0037610
2021-03-08 12:55 tampa Note Added: 0037611
2021-03-09 11:26 tampa Note Added: 0037612
2021-03-09 13:27 mewtwo0641 Note Added: 0037614
2021-03-14 03:22 Ferd Frederix Note Added: 0037659
2021-03-14 10:33 mewtwo0641 Note Added: 0037660
2021-03-14 16:31 mewtwo0641 Note Added: 0037661
2021-03-14 22:04 mewtwo0641 Note Added: 0037662
2021-03-18 23:10 mewtwo0641 Note Added: 0037676
2021-03-18 23:28 tampa Note Added: 0037677
2021-03-19 17:16 mewtwo0641 Note Added: 0037679
2021-03-19 17:18 mewtwo0641 Note Edited: 0037679 View Revisions
2021-03-19 18:26 tampa Note Added: 0037680
2021-03-21 13:04 mewtwo0641 Note Added: 0037681
2021-03-21 13:07 mewtwo0641 Additional Information Updated View Revisions
2021-03-21 13:09 mewtwo0641 Note Edited: 0037681 View Revisions
2021-03-29 01:59 mewtwo0641 Note Added: 0037684
2021-03-29 02:00 mewtwo0641 Additional Information Updated View Revisions
2021-03-29 05:03 tampa Note Added: 0037685
2021-03-29 11:52 tampa Note Added: 0037687
2021-03-29 16:34 tampa Note Edited: 0037687 View Revisions
2021-03-30 09:35 tampa Note Edited: 0037687 View Revisions
2021-03-30 10:48 mewtwo0641 Note Added: 0037688
2021-03-30 10:48 mewtwo0641 Note Edited: 0037688 View Revisions
2021-03-30 12:20 tampa Note Added: 0037689
2021-03-30 12:21 tampa File Added: 0071-Revert-01.patch
2021-03-30 12:21 tampa File Added: 0072-Revert-02.patch
2021-03-30 12:21 tampa File Added: 0073-Revert-03.patch
2021-03-30 12:21 tampa Note Edited: 0037689 View Revisions
2021-04-03 11:01 mewtwo0641 Additional Information Updated View Revisions
2021-04-03 11:01 mewtwo0641 Note Added: 0037701
2021-04-03 12:39 tampa Note Added: 0037702
2021-04-07 16:59 mewtwo0641 Additional Information Updated View Revisions
2021-04-07 17:04 mewtwo0641 Note Added: 0037703
2021-04-07 17:05 mewtwo0641 Note Edited: 0037703 View Revisions
2021-04-07 22:16 tampa Note Added: 0037704
2021-04-12 18:21 mewtwo0641 Note Added: 0037708
2021-04-13 02:12 tampa Note Added: 0037709
2021-04-13 05:31 UbitUmarov Note Added: 0037710
2021-04-15 12:16 mewtwo0641 Note Added: 0037715
2021-05-07 01:32 Frank Hurt Note Added: 0037733
2021-05-29 01:52 mewtwo0641 Note Added: 0037750
2021-05-29 01:53 mewtwo0641 Note Edited: 0037750 View Revisions
2021-05-29 01:54 mewtwo0641 Note Edited: 0037750 View Revisions
2021-05-29 02:21 tampa Note Added: 0037752
2021-10-10 05:03 mewtwo0641 Note Added: 0038002
2021-10-11 03:03 tampa Note Added: 0038003
2021-10-14 09:30 Kubwa Note Added: 0038009
2021-10-14 09:30 Kubwa Note Edited: 0038009 View Revisions
2021-10-14 09:33 Kubwa Note Edited: 0038009 View Revisions
2021-10-14 09:38 Kubwa Note Edited: 0038009 View Revisions
2021-10-15 10:05 Kubwa Note Added: 0038018
2021-10-15 10:16 Kubwa Note Edited: 0038018 View Revisions
2021-10-15 10:47 Kubwa Note Edited: 0038018 View Revisions
2021-10-15 10:47 Kubwa Note Edited: 0038018 View Revisions
2021-10-15 10:48 Kubwa Note Edited: 0038018 View Revisions
2021-10-15 10:49 Kubwa Note Edited: 0038018 View Revisions
2021-10-15 10:50 Kubwa Note Edited: 0038018 View Revisions
2021-10-15 11:00 Kubwa Note Edited: 0038018 View Revisions
2021-10-15 11:13 Kubwa Note Edited: 0038018 View Revisions
2021-10-15 11:41 tampa Note Added: 0038020
2021-10-15 11:43 Kubwa Note Added: 0038021
2021-10-15 12:10 Kubwa Note Edited: 0038021 View Revisions
2021-10-15 12:13 Kubwa Note Edited: 0038021 View Revisions
2021-10-15 14:08 tampa Note Added: 0038022
2021-10-15 14:17 Kubwa Note Added: 0038023
2021-10-15 14:24 tampa Note Added: 0038024


Copyright © 2000 - 2012 MantisBT Group
Powered by Mantis Bugtracker