MantisBT - opensim
View Issue Details
0006586opensim[REGION] OpenSim Corepublic2013-03-26 20:242013-05-13 15:52
kenvc 
kenvc 
highcrashalways
closedfixed 
Xeon Quad dual 3.33 CPUs 16gbWindowsServer 2012
master (dev code) 
master (dev code) 
r22454
Grid (Multiple Regions per Sim)
ODE
.NET / Windows32
None
Imprudence
0006586: CPU usage on a multi-sim instance jumps to 99% and stays there. Started noticing this issue about 3/15/2013.
This is possibly related to mantis 0006582, but not sure. TPs have been very unreliable the last few days using daily dev master updates, but the source of the problem may not be related to teleporting, but that's when I usually notice it.

The last week or so, CPU usage on one of my instances will suddenly jump to 98-99% and stay there hogging all resources on the server. With my setup, the CPU usage is rarely over 15-40% even when starting up multiple instances with multiple sims. When this issue happens, you have to force close the instance that was hogging all the CPU time and then all is back to normal. So far the user logs have not shown any clue as to why this is happening, but when it does a lot of red errors start showing up in the log files because of the slowdown. This is not any easy issue to duplicate unless you are really pushing your I am still going through the process to git bisect to determine exactly when this issue began.
Run a 35 sim mega without oo many prims or scripts on a computer that has a lot of additional load on it, such as other opensim instances or anything that might be slowing down the CPU slightly or hard drive response time. I have been unable to make this happen so far by running the 35 sim mega alone with nothing else running. It seems to do fine then. I have seen this issue happen after startup even with no AVs present. I have also seen it happen on non-megas before but it seems less likely because they take a lot less time to startup. It seems the load on the computer during instance startup is more likely to cause this issue to happen than when the sim is fully started before a heavy load is on the computer. Starting the os instances say one ever 4 minutes is much less likely to see this issue happen than starting one every 30 seconds for example.
Log files attached. The following Info was collected while running r/22454, which was the latest Dev Master as of this moment on 03/29/2013. ODE is the physics engine being used and that log is also attached.

All 8 cores show to be maxed out at almost 100% CPU usage.
The instance with the problem is maxed at 97.67% total CPU time and the other processes are using what little CPU time remains, which makes them have issues too. Memory usage by this instance and others looks completely normal.

Windows has displayed a message that says:
Opensim.32BitLaunch has stopped working

In the details it says:

Description:
  Stopped working

Problem signature:
  Problem Event Name: CLR20r3
  Problem Signature 01: opensim.32bitlaunch.exe
  Problem Signature 02: 1.0.0.0
  Problem Signature 03: 49392088
  Problem Signature 04: mscorlib
  Problem Signature 05: 2.0.0.0
  Problem Signature 06: 4fee7067
  Problem Signature 07: 41c8
  Problem Signature 08: a3
  Problem Signature 09: System.InvalidOperationException
  OS Version: 6.2.9200.2.0.0.400.8
  Locale ID: 1033
  
  
  
  At console, running "scripts stop" from the root sim of the instance ended all scripts in the instance, but had no effect on CPU usage.
  





No tags attached.
txt CPU Usage Problem Notes.txt (1,163) 2013-03-29 01:34
http://opensimulator.org/mantis/file_download.php?file_id=3503&type=bug
log OpenSim.32BitLaunch.log (308,436) 2013-03-29 01:34
http://opensimulator.org/mantis/file_download.php?file_id=3504&type=bug
log XEngine.log (18,282) 2013-03-29 01:34
http://opensimulator.org/mantis/file_download.php?file_id=3505&type=bug
Issue History
2013-03-26 20:24kenvcNew Issue
2013-03-26 20:24kenvcFile Added: Opensim.32BitLaunch.log
2013-03-26 20:27kenvcDescription Updatedbug_revision_view_page.php?rev_id=1074#r1074
2013-03-27 15:55kenvcSummaryCancelling non-responsive TP closed browser and made instance CPU usage jump to 99% and stay there. => CPU usage on a multi-sim instance jumps to 99% and stays there. This started about a week or so ago.
2013-03-27 15:55kenvcDescription Updatedbug_revision_view_page.php?rev_id=1075#r1075
2013-03-27 15:55kenvcSteps to Reproduce Updatedbug_revision_view_page.php?rev_id=1077#r1077
2013-03-27 15:55kenvcAdditional Information Updatedbug_revision_view_page.php?rev_id=1079#r1079
2013-03-27 23:02kenvcFile Deleted: Opensim.32BitLaunch.log
2013-03-29 01:34kenvcFile Added: CPU Usage Problem Notes.txt
2013-03-29 01:34kenvcFile Added: OpenSim.32BitLaunch.log
2013-03-29 01:34kenvcFile Added: XEngine.log
2013-03-29 01:46kenvcGit Revision or version numberTBA => r22454
2013-03-29 01:46kenvcSteps to Reproduce Updatedbug_revision_view_page.php?rev_id=1082#r1082
2013-03-29 01:46kenvcAdditional Information Updatedbug_revision_view_page.php?rev_id=1083#r1083
2013-03-29 01:47kenvcSummaryCPU usage on a multi-sim instance jumps to 99% and stays there. This started about a week or so ago. => CPU usage on a multi-sim instance jumps to 99% and stays there. Started seeing this issue about 10 days ago.
2013-03-30 01:22kenvcNote Added: 0023730
2013-03-30 01:25kenvcNote Edited: 0023730bug_revision_view_page.php?bugnote_id=23730#r1089
2013-03-30 09:27kenvcNote Edited: 0023730bug_revision_view_page.php?bugnote_id=23730#r1090
2013-03-30 15:04kenvcNote Edited: 0023730bug_revision_view_page.php?bugnote_id=23730#r1091
2013-03-31 20:15kenvcNote Added: 0023736
2013-03-31 21:09kenvcNote Edited: 0023736bug_revision_view_page.php?bugnote_id=23736#r1093
2013-04-02 16:23kenvcNote Added: 0023740
2013-04-02 16:37kenvcNote Edited: 0023740bug_revision_view_page.php?bugnote_id=23740#r1097
2013-04-02 16:37kenvcReproducibilitysometimes => always
2013-04-02 16:39kenvcNote Edited: 0023740bug_revision_view_page.php?bugnote_id=23740#r1098
2013-04-02 18:24Allen KerenskyNote Added: 0023742
2013-04-02 18:35kenvcNote Added: 0023743
2013-04-03 16:42justinccNote Added: 0023745
2013-04-03 17:55kenvcNote Added: 0023747
2013-04-03 18:24kenvcNote Added: 0023749
2013-04-04 16:08justinccNote Added: 0023753
2013-04-04 16:25justinccNote Added: 0023755
2013-04-18 10:32kenvcNote Added: 0023783
2013-05-03 19:23kenvcNote Added: 0023832
2013-05-03 19:31kenvcNote Edited: 0023832bug_revision_view_page.php?bugnote_id=23832#r1147
2013-05-03 19:33kenvcSummaryCPU usage on a multi-sim instance jumps to 99% and stays there. Started seeing this issue about 10 days ago. => CPU usage on a multi-sim instance jumps to 99% and stays there. Started noticing this issue about 3/15/2013.
2013-05-13 15:51kenvcNote Added: 0023880
2013-05-13 15:51kenvcStatusnew => resolved
2013-05-13 15:51kenvcFixed in Version => master (dev code)
2013-05-13 15:51kenvcResolutionopen => fixed
2013-05-13 15:51kenvcAssigned To => kenvc
2013-05-13 15:52kenvcStatusresolved => closed

Notes
(0023730)
kenvc   
2013-03-30 01:22   
(edited on: 2013-03-30 15:04)
Justin,
I may be onto something. I made the following changes to opensim.ini and restarted all instances at the same time to put maximum load on the system. So far no issues at all and memory consumption even seems a little less! I am thinking doubling the ThreadStackSize from the default setting may be the main thing that helped. Only way to know would be to put one setting at a time back to default until the CPU maxes out again. Let me know what you think.

[Startup]
    async_call_method = UnsafeQueueUserWorkItem
    
[XEngine]
    MaxThreads = 100
    MaxScriptEventQueue = 300
    ThreadStackSize = 524288

(0023736)
kenvc   
2013-03-31 20:15   
(edited on: 2013-03-31 21:09)
Other than the unusually unreliable teleports, the changes in the previous note have completely resolve the 99% CPU usage issue. Still not certain which change fixed the problem, but this may be a clue as to what area still needs work to prevent this issue in the future... without having to experiment with settings changes to fix it.

I am still at a loss as to why teleports seem to be a lot less reliable than they were a few weeks (maybe as long as a month) ago.

(0023740)
kenvc   
2013-04-02 16:23   
(edited on: 2013-04-02 16:39)
The problem is apparently with async_call_method = SmartThreadPool setting. With that setting the 100% CPU problem came right back within 10 minutes or restarting a large mega. This test was on 64 bit windows server 2012 using the 32 bit opensim. The MaxPoolThreads was set to 1200 during this test if this info helps.

I changed it back to UnsafeQueueUserWorkItem, and it is running as smooth as silk.

(0023742)
Allen Kerensky   
2013-04-02 18:24   
Using SmartThreadPool with MaxPoolThreads = 1200 may be what's causing your CPU usage to spike on an 8 core.

This high of a setting will "sabotage" how .NET fires off work threads, in this case, allowing .NET to fire up too many, causing the CPU to get swamped.

As a rule of thumb for tuning, I'd suggest 0000052:0000030-45 threads per core.
So, when you try the SmartThreadPool again, try MaxPoolThreads = 240-360 rather than 1200.

Also, I usually see the ThreadStackSize setup to 262144.
So, if cutting back the MaxPoolThreads works, then you might also try cutting back the ThreadStackSize as well.
(0023743)
kenvc   
2013-04-02 18:35   
I used to have MaxPoolThreads at 45, but observed chat on Opensim-dev that suggested a way to compute the best number based on the number of cores etc, and the number it came up with for my computer was actually more than 1200.
(0023745)
justincc   
2013-04-03 16:42   
@kenvc Did you try QueueUserWorkItem by the way? If not, could you? If you have, are the effect any different?
(0023747)
kenvc   
2013-04-03 17:55   
Justincc,
Just tried QueueUserWorkItem on 2 large megas. The 2nd one was started about 30 seconds behind the first one and within a minute or 2 the CPU went to 100% on the 2nd one. I forcefully closed the 2nd one and the CPU went back to normal with the first one still starting up, then within a minute or so the first one went to 100% CPU also.

It appears this issue is still present when using QueueUserWorkItem so switching them back to UnsafeQueueUserWorkItem.
(0023749)
kenvc   
2013-04-03 18:24   
I just restarted everything with it set back to UnsafeQueueUserWorkItem.
I watched the Resource Monitor very closely and found this setting for sure uses significantly less CPU time during startup. The CPU never jumped higher than 80% and it settled down nicely to about 20% once all instances were loaded.

Using the other settings if it ever hits 100% during startup, you can almost count on one of the instances very shortly going to 100% and staying there.
(0023753)
justincc   
2013-04-04 16:08   
Regarding the teleport issues, could you try replacing the current bin/HttpServer_OpenSim.dll with the one that was in the OpenSimulator 0.7.5 package on both source and destination simulators (or just on the single simulator if appropriate)? I've tried this and it appears to be compatible. I'm interested in seeing whether changes to this DLL have anything to do with the unreliable teleports.

I'd also be interested in knowing whether you're seeing unreliable teleports within a simulator or just between simulators (and whether these simulators are running on different machines, on the same network, on different networks, etc.).
(0023755)
justincc   
2013-04-04 16:25   
And/or please could you try the very latest git master d236796 since this contains an HTTP related fix (network comms are a critical part of the teleport process when going between simulators).
(0023783)
kenvc   
2013-04-18 10:32   
Justin,
This is still easily reproducible now that I know what to do to avoid the problem. I can start several normal (non-mega) instances up and then start a large mega up while the others are still starting up, and within a few minutes the mega causes the CPU usage to go high and stay there. The mega can be forcefully closed and then CPU goes back to normal. The only thing I see in the log that looks the least bit unusual is Watchdog timeout messages. This high load startup condition is for sure creating some type of race condition that seems to be much more likely to happen in a mega while it is starting up.
(0023832)
kenvc   
2013-05-03 19:23   
(edited on: 2013-05-03 19:31)
Justin,

It appears a recent change (using r/22679 now) is now preventing the CPU from going to 99-100% and staying there, but something odd is still going on.

Using the same conditions to reproduce this problem as described in the last note, the CPU usage on the last instance loaded never settles back even close to normal even hours later. The CPU usage on all the other sims is about 1-2% per instance max, but the CPU usage on the last one loaded doesn't go below 20-25%. I can exit the last instance loaded, and restart it (with all other instances still running), and after it loads the CPU settles in to about 1-2% as it should be. There is nothing in the log files to give any clue why this is happening. No errors, and not even any warnings.

In Summary: When starting multiple instances are loaded together, the CPU usage stays much higher than normal and never goes back down to normal even hours later. If I start them up making sure the previous instance is totally loaded before starting the next instance, the CPU usage that results after they are all loaded is much less and appears to stay at normal levels.

(0023880)
kenvc   
2013-05-13 15:51   
This issue appears to be at least 90% fixed at this point. If it starts happening again, a new mantis can be issued or this one can be reopened.