Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006586opensim[REGION] OpenSim Corepublic2013-03-26 20:242013-05-13 15:52
Reporterkenvc 
Assigned Tokenvc 
PriorityhighSeveritycrashReproducibilityalways
StatusclosedResolutionfixed 
PlatformXeon Quad dual 3.33 CPUs 16gbOSWindowsOS VersionServer 2012
Product Versionmaster (dev code) 
Target VersionFixed in Versionmaster (dev code) 
Summary0006586: CPU usage on a multi-sim instance jumps to 99% and stays there. Started noticing this issue about 3/15/2013.
DescriptionThis is possibly related to mantis 0006582, but not sure. TPs have been very unreliable the last few days using daily dev master updates, but the source of the problem may not be related to teleporting, but that's when I usually notice it.

The last week or so, CPU usage on one of my instances will suddenly jump to 98-99% and stay there hogging all resources on the server. With my setup, the CPU usage is rarely over 15-40% even when starting up multiple instances with multiple sims. When this issue happens, you have to force close the instance that was hogging all the CPU time and then all is back to normal. So far the user logs have not shown any clue as to why this is happening, but when it does a lot of red errors start showing up in the log files because of the slowdown. This is not any easy issue to duplicate unless you are really pushing your I am still going through the process to git bisect to determine exactly when this issue began.
Steps To ReproduceRun a 35 sim mega without oo many prims or scripts on a computer that has a lot of additional load on it, such as other opensim instances or anything that might be slowing down the CPU slightly or hard drive response time. I have been unable to make this happen so far by running the 35 sim mega alone with nothing else running. It seems to do fine then. I have seen this issue happen after startup even with no AVs present. I have also seen it happen on non-megas before but it seems less likely because they take a lot less time to startup. It seems the load on the computer during instance startup is more likely to cause this issue to happen than when the sim is fully started before a heavy load is on the computer. Starting the os instances say one ever 4 minutes is much less likely to see this issue happen than starting one every 30 seconds for example.
Additional InformationLog files attached. The following Info was collected while running r/22454, which was the latest Dev Master as of this moment on 03/29/2013. ODE is the physics engine being used and that log is also attached.

All 8 cores show to be maxed out at almost 100% CPU usage.
The instance with the problem is maxed at 97.67% total CPU time and the other processes are using what little CPU time remains, which makes them have issues too. Memory usage by this instance and others looks completely normal.

Windows has displayed a message that says:
Opensim.32BitLaunch has stopped working

In the details it says:

Description:
  Stopped working

Problem signature:
  Problem Event Name: CLR20r3
  Problem Signature 01: opensim.32bitlaunch.exe
  Problem Signature 02: 1.0.0.0
  Problem Signature 03: 49392088
  Problem Signature 04: mscorlib
  Problem Signature 05: 2.0.0.0
  Problem Signature 06: 4fee7067
  Problem Signature 07: 41c8
  Problem Signature 08: a3
  Problem Signature 09: System.InvalidOperationException
  OS Version: 6.2.9200.2.0.0.400.8
  Locale ID: 1033
  
  
  
  At console, running "scripts stop" from the root sim of the instance ended all scripts in the instance, but had no effect on CPU usage.
  





TagsNo tags attached.
Git Revision or version numberr22454
Run Mode Grid (Multiple Regions per Sim)
Physics EngineODE
Script Engine
Environment.NET / Windows32
Mono VersionNone
ViewerImprudence
Attached Filestxt file icon CPU Usage Problem Notes.txt [^] (1,163 bytes) 2013-03-29 01:34 [Show Content]
log file icon OpenSim.32BitLaunch.log [^] (308,436 bytes) 2013-03-29 01:34
log file icon XEngine.log [^] (18,282 bytes) 2013-03-29 01:34

- Relationships

-  Notes
(0023730)
kenvc (reporter)
2013-03-30 01:22
edited on: 2013-03-30 15:04

Justin,
I may be onto something. I made the following changes to opensim.ini and restarted all instances at the same time to put maximum load on the system. So far no issues at all and memory consumption even seems a little less! I am thinking doubling the ThreadStackSize from the default setting may be the main thing that helped. Only way to know would be to put one setting at a time back to default until the CPU maxes out again. Let me know what you think.

[Startup]
    async_call_method = UnsafeQueueUserWorkItem
    
[XEngine]
    MaxThreads = 100
    MaxScriptEventQueue = 300
    ThreadStackSize = 524288

(0023736)
kenvc (reporter)
2013-03-31 20:15
edited on: 2013-03-31 21:09

Other than the unusually unreliable teleports, the changes in the previous note have completely resolve the 99% CPU usage issue. Still not certain which change fixed the problem, but this may be a clue as to what area still needs work to prevent this issue in the future... without having to experiment with settings changes to fix it.

I am still at a loss as to why teleports seem to be a lot less reliable than they were a few weeks (maybe as long as a month) ago.

(0023740)
kenvc (reporter)
2013-04-02 16:23
edited on: 2013-04-02 16:39

The problem is apparently with async_call_method = SmartThreadPool setting. With that setting the 100% CPU problem came right back within 10 minutes or restarting a large mega. This test was on 64 bit windows server 2012 using the 32 bit opensim. The MaxPoolThreads was set to 1200 during this test if this info helps.

I changed it back to UnsafeQueueUserWorkItem, and it is running as smooth as silk.

(0023742)
Allen Kerensky (reporter)
2013-04-02 18:24

Using SmartThreadPool with MaxPoolThreads = 1200 may be what's causing your CPU usage to spike on an 8 core.

This high of a setting will "sabotage" how .NET fires off work threads, in this case, allowing .NET to fire up too many, causing the CPU to get swamped.

As a rule of thumb for tuning, I'd suggest 0000052:0000030-45 threads per core.
So, when you try the SmartThreadPool again, try MaxPoolThreads = 240-360 rather than 1200.

Also, I usually see the ThreadStackSize setup to 262144.
So, if cutting back the MaxPoolThreads works, then you might also try cutting back the ThreadStackSize as well.
(0023743)
kenvc (reporter)
2013-04-02 18:35

I used to have MaxPoolThreads at 45, but observed chat on Opensim-dev that suggested a way to compute the best number based on the number of cores etc, and the number it came up with for my computer was actually more than 1200.
(0023745)
justincc (administrator)
2013-04-03 16:42

@kenvc Did you try QueueUserWorkItem by the way? If not, could you? If you have, are the effect any different?
(0023747)
kenvc (reporter)
2013-04-03 17:55

Justincc,
Just tried QueueUserWorkItem on 2 large megas. The 2nd one was started about 30 seconds behind the first one and within a minute or 2 the CPU went to 100% on the 2nd one. I forcefully closed the 2nd one and the CPU went back to normal with the first one still starting up, then within a minute or so the first one went to 100% CPU also.

It appears this issue is still present when using QueueUserWorkItem so switching them back to UnsafeQueueUserWorkItem.
(0023749)
kenvc (reporter)
2013-04-03 18:24

I just restarted everything with it set back to UnsafeQueueUserWorkItem.
I watched the Resource Monitor very closely and found this setting for sure uses significantly less CPU time during startup. The CPU never jumped higher than 80% and it settled down nicely to about 20% once all instances were loaded.

Using the other settings if it ever hits 100% during startup, you can almost count on one of the instances very shortly going to 100% and staying there.
(0023753)
justincc (administrator)
2013-04-04 16:08

Regarding the teleport issues, could you try replacing the current bin/HttpServer_OpenSim.dll with the one that was in the OpenSimulator 0.7.5 package on both source and destination simulators (or just on the single simulator if appropriate)? I've tried this and it appears to be compatible. I'm interested in seeing whether changes to this DLL have anything to do with the unreliable teleports.

I'd also be interested in knowing whether you're seeing unreliable teleports within a simulator or just between simulators (and whether these simulators are running on different machines, on the same network, on different networks, etc.).
(0023755)
justincc (administrator)
2013-04-04 16:25

And/or please could you try the very latest git master d236796 since this contains an HTTP related fix (network comms are a critical part of the teleport process when going between simulators).
(0023783)
kenvc (reporter)
2013-04-18 10:32

Justin,
This is still easily reproducible now that I know what to do to avoid the problem. I can start several normal (non-mega) instances up and then start a large mega up while the others are still starting up, and within a few minutes the mega causes the CPU usage to go high and stay there. The mega can be forcefully closed and then CPU goes back to normal. The only thing I see in the log that looks the least bit unusual is Watchdog timeout messages. This high load startup condition is for sure creating some type of race condition that seems to be much more likely to happen in a mega while it is starting up.
(0023832)
kenvc (reporter)
2013-05-03 19:23
edited on: 2013-05-03 19:31

Justin,

It appears a recent change (using r/22679 now) is now preventing the CPU from going to 99-100% and staying there, but something odd is still going on.

Using the same conditions to reproduce this problem as described in the last note, the CPU usage on the last instance loaded never settles back even close to normal even hours later. The CPU usage on all the other sims is about 1-2% per instance max, but the CPU usage on the last one loaded doesn't go below 20-25%. I can exit the last instance loaded, and restart it (with all other instances still running), and after it loads the CPU settles in to about 1-2% as it should be. There is nothing in the log files to give any clue why this is happening. No errors, and not even any warnings.

In Summary: When starting multiple instances are loaded together, the CPU usage stays much higher than normal and never goes back down to normal even hours later. If I start them up making sure the previous instance is totally loaded before starting the next instance, the CPU usage that results after they are all loaded is much less and appears to stay at normal levels.

(0023880)
kenvc (reporter)
2013-05-13 15:51

This issue appears to be at least 90% fixed at this point. If it starts happening again, a new mantis can be issued or this one can be reopened.

- Issue History
Date Modified Username Field Change
2013-03-26 20:24 kenvc New Issue
2013-03-26 20:24 kenvc File Added: Opensim.32BitLaunch.log
2013-03-26 20:27 kenvc Description Updated View Revisions
2013-03-27 15:55 kenvc Summary Cancelling non-responsive TP closed browser and made instance CPU usage jump to 99% and stay there. => CPU usage on a multi-sim instance jumps to 99% and stays there. This started about a week or so ago.
2013-03-27 15:55 kenvc Description Updated View Revisions
2013-03-27 15:55 kenvc Steps to Reproduce Updated View Revisions
2013-03-27 15:55 kenvc Additional Information Updated View Revisions
2013-03-27 23:02 kenvc File Deleted: Opensim.32BitLaunch.log
2013-03-29 01:34 kenvc File Added: CPU Usage Problem Notes.txt
2013-03-29 01:34 kenvc File Added: OpenSim.32BitLaunch.log
2013-03-29 01:34 kenvc File Added: XEngine.log
2013-03-29 01:46 kenvc Git Revision or version number TBA => r22454
2013-03-29 01:46 kenvc Steps to Reproduce Updated View Revisions
2013-03-29 01:46 kenvc Additional Information Updated View Revisions
2013-03-29 01:47 kenvc Summary CPU usage on a multi-sim instance jumps to 99% and stays there. This started about a week or so ago. => CPU usage on a multi-sim instance jumps to 99% and stays there. Started seeing this issue about 10 days ago.
2013-03-30 01:22 kenvc Note Added: 0023730
2013-03-30 01:25 kenvc Note Edited: 0023730 View Revisions
2013-03-30 09:27 kenvc Note Edited: 0023730 View Revisions
2013-03-30 15:04 kenvc Note Edited: 0023730 View Revisions
2013-03-31 20:15 kenvc Note Added: 0023736
2013-03-31 21:09 kenvc Note Edited: 0023736 View Revisions
2013-04-02 16:23 kenvc Note Added: 0023740
2013-04-02 16:37 kenvc Note Edited: 0023740 View Revisions
2013-04-02 16:37 kenvc Reproducibility sometimes => always
2013-04-02 16:39 kenvc Note Edited: 0023740 View Revisions
2013-04-02 18:24 Allen Kerensky Note Added: 0023742
2013-04-02 18:35 kenvc Note Added: 0023743
2013-04-03 16:42 justincc Note Added: 0023745
2013-04-03 17:55 kenvc Note Added: 0023747
2013-04-03 18:24 kenvc Note Added: 0023749
2013-04-04 16:08 justincc Note Added: 0023753
2013-04-04 16:25 justincc Note Added: 0023755
2013-04-18 10:32 kenvc Note Added: 0023783
2013-05-03 19:23 kenvc Note Added: 0023832
2013-05-03 19:31 kenvc Note Edited: 0023832 View Revisions
2013-05-03 19:33 kenvc Summary CPU usage on a multi-sim instance jumps to 99% and stays there. Started seeing this issue about 10 days ago. => CPU usage on a multi-sim instance jumps to 99% and stays there. Started noticing this issue about 3/15/2013.
2013-05-13 15:51 kenvc Note Added: 0023880
2013-05-13 15:51 kenvc Status new => resolved
2013-05-13 15:51 kenvc Fixed in Version => master (dev code)
2013-05-13 15:51 kenvc Resolution open => fixed
2013-05-13 15:51 kenvc Assigned To => kenvc
2013-05-13 15:52 kenvc Status resolved => closed


Copyright © 2000 - 2012 MantisBT Group
Powered by Mantis Bugtracker