Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0008832opensim[REGION] OpenSim Corepublic2020-12-12 09:322020-12-18 02:23
ReporterAbaddon 
Assigned To 
PrioritynormalSeveritytrivialReproducibilityalways
StatusnewResolutionopen 
PlatformWindows 2008 R2Operating SystemOperating System Version
Product Version 
Target VersionFixed in Version 
Summary0008832: Watchdog timeouts freeze and lag server and console
Description19:21:03 - [ENTITY TRANSFER MODULE]: Closing agent X Y in region after teleport
19:21:03 - [CLIENT]: Close has been called for X Y attached to scene region
19:21:03 - [JobEngine]: Stopping AsyncInUDP-23bf2840-cbcc-4435-98d9-1d2b5895a3e9
19:21:03 - [SCENE]: Removing child agent X Y 23bf4840-cbcc-4435-98d9-1d2b5895a3e9 from region
19:21:03 - [CAPS]: Remove caps for agent 23bf2840-cbcc-4435-98d9-1d2b5895a3e9 in region region
19:21:03 - [Scene]: The avatar has left the building
19:21:06 - [WATCHDOG]: Timeout detected for thread "Incoming Packets (Region1)". ThreadState=Background. Last tick was 6318ms ago.
19:21:07 - [WATCHDOG]: Timeout detected for thread "Incoming Packets (Region2)". ThreadState=Background, WaitSleepJoin. Last tick was 7722ms ago.
19:21:07 - [WATCHDOG]: Timeout detected for thread "Incoming Packets (Region3)". ThreadState=Background, WaitSleepJoin. Last tick was 7644ms ago.
Steps To ReproduceThis is very usual to happen when users TP out of server or even better, it always happens when I run "generate map" from the server console also. The server seems to freeze, console does not respond, and for the logged in user creates lag.

Changing the threads to unsafe helped but did not eliminate the problem.
Additional InformationI also run a linux based grid server and it does not seem to have any watchdog timeout problems.
TagsNo tags attached.
Git Revision or version number5bb21ff982
Run Mode Grid (Multiple Regions per Sim)
Physics EngineBulletSim
Script EngineXEngine
Environment.NET / Windows64
Mono VersionNone
ViewerAny viewer
Attached Files

- Relationships

-  Notes
(0037358)
tampa (reporter)
2020-12-13 00:32

If you are experiencing "lag" while the simulator is generating a maptile then you are either overloading the hardware or have a particularly complex maptile and Warp3d enabled, because under normal circumstances that should not cause anything noticeable on the region.
(0037360)
UbitUmarov (administrator)
2020-12-13 01:13

Those timeouts are consequence not cause.
Guess the cpu just got overloaded, all cores busy....
Opensim is a demanding App.. it likes cpus with many cores and high clocks and that is a hard and expensive req. Also the more cores in use the slower the cpus run so they don't meltdown.
High clocks should be a better option than many cores, still.
Normal datacenter servers usually provide reasonable high number of cores but at moderate clocks, since that is good for typical server applications like web servers.
Opensim is a bit atypical App. It behaves as a server, but does need high execution performance for scripts, physics and insane data conversions.
It also depends on region usage, a social one may be better on a server with high number cores moderate clock, while more physics/script intensive may be better on one with less cores, higher clocks.
(0037361)
Abaddon (reporter)
2020-12-13 03:52

8 CPUs at 3GHz on a datacenter server should be more than enough computing power to handle this. Checking task manager shows 0% CPU usage with peak of maybe 5% every 20 seconds or so. How can CPUs be busy with such usage? No Warp3d maptiles are enabled and only one user logged in.
(0037362)
UbitUmarov (administrator)
2020-12-13 03:55

not clear what opensim version
those where generic considerations.
garbage collect my also stop execution, causing that
anyway hard to debug
(0037378)
Ferd Frederix (reporter)
2020-12-14 07:26

More likely a script is causing this. Xengine hates llSleep(). I would go to top scripts and kill off the ones that float to the top when sorted by time. Almost always I find an event, such as a timer or a touch event with llSleep().

Or try Yengine.
(0037380)
tampa (reporter)
2020-12-14 07:34

You may want to check clocksource if you are having threading issues, granted on win this is less an issue.

Server 2008 is a bit outdated, what dotnet version is on it?
(0037385)
Abaddon (reporter)
2020-12-14 12:11
edited on: 2020-12-14 12:17

I am using the latest Opensim development compiled straight from git sources for windows and linux. Windows runs with .NET 4.6, the recommended one.

Ferd, I think you may have a good point. This may be a script related issue.

I tried to run the same grids on a debian 10 linux with 6.12 mono and it crashed. No matter what I was doing (increased ulimit to 4MB and threads to 1024), mono seemed to crash at the same exact point when loading the scripts of a specific region. Maybe this is related and may make debugging a bit easier.

(0037386)
Ferd Frederix (reporter)
2020-12-14 16:01

Since this happens when you quit, or load the CPU by making maps, and others read this, let me point out that a slow disk could be involved. It may not be your situation, and does not happen with SSD's.

Hard disks can get slower and slower over time, yet will not use a spare track. Nothing will show up in the SMART status. Doesn;tmatter how old. I have seen HDDS that are 2 months old develop this. I have scanned maybe 100 Opensim servers with HDTUNE.exe from hdtune.com (free for 15 days eval, though I bought one and use it on my machines at work and at home. As a result, I no longer use HDDS at home or work except for backup.

I would guess that 1 out of 5 systems I have scanned has a very slow HDD. Just saw a nice Dell last night doing 5 to 10 MB/second. Should be 100 to 150. It certainly can cause these timeout issues.

One clue is when you quit or type backup. This forces Opensim to flush changed states in scripts to disk. A rapidly changing script has to constantly save state to disk. And a low HDD can cause it to far very far behind. A backup on a decent server and with many wellmade scripts should take a split second to a second or two.

Look at the hdtune web site and you see a graph of block size vs speed when reading. That's the best test. The rate (Y) droops as the block size (X) goes up, due to more data. A bad HDD has spikes that drop way down. Some I have seen go to 1.5 Kbytes/second, when they should be doing 100 MB. It tries to read, over and over, and eventually succeeds. The the lying little spinner goes on its happy way.

Also try typing in the region this command: ' debug http all N '. N is a number from 0 to 6, 0 being off. I have seen one region with bad scripts in an endless loop writing to disk over and over, using up all the mysql bandwidth, yet making no progress doing the backups when quitting. CPU was low, but disk and mysql was going wild.
(0037387)
Abaddon (reporter)
2020-12-14 16:22

Regions run on SSD :)
(0037388)
UbitUmarov (administrator)
2020-12-14 18:58

note also that running several regions on same instance makes issues worse.
a problem with one impacts all directly . Memory use can cause more stress on the nasty garbage collector, etc, etc..
of course load on one instance does impact load on the machine, but the operating system is there trying to keep that under control.
(0037389)
Abaddon (reporter)
2020-12-15 00:46

Memory is monitored and is at 50% of total VM memory assigned. It is a code issue and very possibly a script one.
(0037391)
Abaddon (reporter)
2020-12-15 14:15

This is the mono crash when I try to load the same grids from a debian/mono system


=================================================================
        Native Crash Reporting
=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================

=================================================================
        Native stacktrace:
=================================================================
        0x5654bbae4ffb - mono :
        0x5654bbae5399 - mono :
        0x5654bba917f4 - mono :
        0x5654bbae4596 - mono :
        0x7faa1c0e7730 - /lib/x86_64-linux-gnu/libpthread.so.0 :
        0x7faa1bc287bb - /lib/x86_64-linux-gnu/libc.so.6 : gsignal
        0x7faa1bc13535 - /lib/x86_64-linux-gnu/libc.so.6 : abort
        0x5654bba54fc5 - mono :
        0x5654bbd321a6 - mono :
        0x5654bbd4ef9b - mono :
        0x5654bbd4f61d - mono : monoeg_assertion_message
        0x5654bbd4f65a - mono :
        0x5654bbae8696 - mono :
        0x5654bbaeb9f3 - mono :
        0x5654bbaebd42 - mono :
        0x5654bba5d1d1 - mono :
        0x5654bba951a6 - mono :
        0x5654bba95ca0 - mono :
        0x41c2a393 - Unknown

=================================================================
        Telemetry Dumper:
=================================================================
Pkilling 0x140361647257344x from 0x140361670309632x
Pkilling 0x140361653499648x from 0x140361670309632x
Pkilling 0x140356133717760x from 0x140361670309632x
Pkilling 0x140356167337728x from 0x140361670309632x
Pkilling 0x140367462565632x from 0x140361670309632x
Pkilling 0x140364941874944x from 0x140361670309632x
Pkilling 0x140367454111488x from 0x140361670309632x
Pkilling 0x140365417764608x from 0x140361670309632x
Pkilling 0x140356114806528x from 0x140361670309632x
Pkilling 0x140361668208384x from 0x140361670309632x
Pkilling 0x140356148426496x from 0x140361670309632x
Pkilling 0x140366789740288x from 0x140361670309632x
Pkilling 0x140356182046464x from 0x140361670309632x
Pkilling 0x140366826649344x from 0x140361670309632x
Pkilling 0x140360332343040x from 0x140361670309632x
Pkilling 0x140364043925248x from 0x140361670309632x
Pkilling 0x140360603907840x from 0x140361670309632x
Pkilling 0x140366819608320x from 0x140361670309632x
Pkilling 0x140356129515264x from 0x140361670309632x
Pkilling 0x140364037625600x from 0x140361670309632x
Pkilling 0x140358710195968x from 0x140361670309632x
Pkilling 0x140361252992768x from 0x140361670309632x
Pkilling 0x140356163135232x from 0x140361670309632x
Pkilling 0x140362943297280x from 0x140361670309632x
Pkilling 0x140366841358080x from 0x140361670309632x
Pkilling 0x140362471438080x from 0x140361670309632x
Pkilling 0x140364937672448x from 0x140361670309632x
Pkilling 0x140358260352768x from 0x140361670309632x
Pkilling 0x140359601485568x from 0x140361670309632x
Pkilling 0x140359105505024x from 0x140361670309632x
Pkilling 0x140361664005888x from 0x140361670309632x
Pkilling 0x140356144224000x from 0x140361670309632x
Pkilling 0x140361639917312x from 0x140361670309632x
Pkilling 0x140356177843968x from 0x140361670309632x
Pkilling 0x140360536762112x from 0x140361670309632x
Pkilling 0x140360599705344x from 0x140361670309632x
Pkilling 0x140361889478400x from 0x140361670309632x
Pkilling 0x140356125312768x from 0x140361670309632x
Pkilling 0x140356041570048x from 0x140361670309632x
Pkilling 0x140361678714624x from 0x140361670309632x
Pkilling 0x140356158932736x from 0x140361670309632x
Pkilling 0x140360979310336x from 0x140361670309632x
Pkilling 0x140366837155584x from 0x140361670309632x
Pkilling 0x140360551470848x from 0x140361670309632x
Pkilling 0x140364933469952x from 0x140361670309632x
Pkilling 0x140365379028736x from 0x140361670309632x
Pkilling 0x140358192199424x from 0x140361670309632x
Pkilling 0x140360437200640x from 0x140361670309632x
Pkilling 0x140362484016896x from 0x140361670309632x
Pkilling 0x140361499092736x from 0x140361670309632x
Pkilling 0x140357866088192x from 0x140361670309632x
Pkilling 0x140361659803392x from 0x140361670309632x
Pkilling 0x140356140021504x from 0x140361670309632x
Pkilling 0x140361635714816x from 0x140361670309632x
Pkilling 0x140364053702400x from 0x140361670309632x
Pkilling 0x140360313468672x from 0x140361670309632x
Pkilling 0x140356173641472x from 0x140361670309632x
Pkilling 0x140357360678656x from 0x140361670309632x
Pkilling 0x140363494848256x from 0x140361670309632x
Pkilling 0x140360532559616x from 0x140361670309632x
Pkilling 0x140360807343872x from 0x140361670309632x
Pkilling 0x140360595502848x from 0x140361670309632x
Pkilling 0x140367468869376x from 0x140361670309632x
Pkilling 0x140366846097152x from 0x140361670309632x
Pkilling 0x140361885275904x from 0x140361670309632x
Pkilling 0x140356121110272x from 0x140361670309632x
Pkilling 0x140361674512128x from 0x140361670309632x
Pkilling 0x140366814295808x from 0x140361670309632x
Pkilling 0x140362949588736x from 0x140361670309632x
Pkilling 0x140366832953088x from 0x140361670309632x
Pkilling 0x140363421923072x from 0x140361670309632x
Pkilling 0x140361309615872x from 0x140361670309632x
Pkilling 0x140365392811776x from 0x140361670309632x
Pkilling 0x140362528061184x from 0x140361670309632x
Pkilling 0x140361655600896x from 0x140361670309632x
Pkilling 0x140356135819008x from 0x140361670309632x
Pkilling 0x140361631512320x from 0x140361670309632x
Pkilling 0x140356169438976x from 0x140361670309632x
Pkilling 0x140357356476160x from 0x140361670309632x
Pkilling 0x140367464666880x from 0x140361670309632x
Pkilling 0x140364943976192x from 0x140361670309632x
Pkilling 0x140368585504512x from 0x140361670309632x
Pkilling 0x140362479826688x from 0x140361670309632x
Pkilling 0x140359559542528x from 0x140361670309632x
Pkilling 0x140356116907776x from 0x140361670309632x
Pkilling 0x140366791841536x from 0x140361670309632x
Pkilling 0x140361362044672x from 0x140361670309632x
Pkilling 0x140366828750592x from 0x140361670309632x
Pkilling 0x140360629090048x from 0x140361670309632x
Pkilling 0x140364340000512x from 0x140361670309632x
Pkilling 0x140360543065856x from 0x140361670309632x
Pkilling 0x140362937001728x from 0x140361670309632x
Pkilling 0x140366812178176x from 0x140361670309632x
Pkilling 0x140360606009088x from 0x140361670309632x
Pkilling 0x140358695515904x from 0x140361670309632x
Pkilling 0x140361645156096x from 0x140361670309632x
Pkilling 0x140361651398400x from 0x140361670309632x
Pkilling 0x140356131616512x from 0x140361670309632x
Pkilling 0x140356165236480x from 0x140361670309632x
Pkilling 0x140361313810176x from 0x140361670309632x
Pkilling 0x140363414107904x from 0x140361670309632x
Pkilling 0x140366843459328x from 0x140361670309632x
Pkilling 0x140360587097856x from 0x140361670309632x
Pkilling 0x140367460464384x from 0x140361670309632x
Pkilling 0x140364939773696x from 0x140361670309632x
Pkilling 0x140360557774592x from 0x140361670309632x
Pkilling 0x140358262454016x from 0x140361670309632x
Pkilling 0x140359107606272x from 0x140361670309632x
Pkilling 0x140356112705280x from 0x140361670309632x
Pkilling 0x140361666107136x from 0x140361670309632x
Pkilling 0x140356146325248x from 0x140361670309632x
Pkilling 0x140366824548096x from 0x140361670309632x
Pkilling 0x140358250919680x from 0x140361670309632x
Pkilling 0x140360538863360x from 0x140361670309632x
Pkilling 0x140366810060544x from 0x140361670309632x
Pkilling 0x140356127414016x from 0x140361670309632x
Pkilling 0x140364035524352x from 0x140361670309632x
Pkilling 0x140356161033984x from 0x140361670309632x
Pkilling 0x140360981411584x from 0x140361670309632x
Pkilling 0x140366839256832x from 0x140361670309632x
Pkilling 0x140362469336832x from 0x140361670309632x
Pkilling 0x140360582895360x from 0x140361670309632x
Pkilling 0x140360553572096x from 0x140361670309632x
Pkilling 0x140364935571200x from 0x140361670309632x
Pkilling 0x140358258251520x from 0x140361670309632x
Pkilling 0x140367458346752x from 0x140361670309632x
Pkilling 0x140362486118144x from 0x140361670309632x
Could not exec mono-hang-watchdog, expected on path '/etc/../bin/mono-hang-watchdog' (errno 2)
Entering thread summarizer pause from 0x140361670309632x
Finished thread summarizer pause from 0x140361670309632x.
Failed to create breadcrumb file (null)/crash_hash_0x236a32d885

Waiting for dumping threads to resume

=================================================================
        External Debugger Dump:
=================================================================
mono_gdb_render_native_backtraces not supported on this platform, unable to find gdb or lldb

=================================================================
        Basic Fault Address Reporting
=================================================================
Memory around native instruction pointer (0x7faa1bc287bb):0x7faa1bc287ab d2 4c 89 ce bf 02 00 00 00 b8 0e 00 00 00 0f 05 .L..............
0x7faa1bc287bb 48 8b 8c 24 08 01 00 00 64 48 33 0c 25 28 00 00 H..$....dH3.%(..
0x7faa1bc287cb 00 44 89 c0 75 1d 48 81 c4 10 01 00 00 5b c3 66 .D..u.H......[.f
0x7faa1bc287db 0f 1f 44 00 00 48 8b 15 89 36 18 00 f7 d8 64 89 ..D..H...6....d.

=================================================================
        Managed Stacktrace:
=================================================================
          at <unknown> <0xffffffff>
          at System.OrdinalIgnoreCaseComparer:GetHashCode <0x00033>
          at System.StringComparer:GetHashCode <0x00079>
          at System.Collections.Hashtable:GetHash <0x0004e>
          at System.Collections.Hashtable:InitHash <0x0004c>
          at System.Collections.Hashtable:Insert <0x000bb>
          at System.Collections.Hashtable:set_Item <0x00043>
          at log4net.Core.LevelMap:Add <0x000b6>
          at log4net.Repository.LoggerRepositorySkeleton:AddBuiltinLevels <0x0003f>
          at log4net.Repository.LoggerRepositorySkeleton:.ctor <0x00187>
          at log4net.Repository.Hierarchy.Hierarchy:.ctor <0x0003f>
          at log4net.Repository.Hierarchy.Hierarchy:.ctor <0x0005b>
          at log4net.Repository.Hierarchy.Hierarchy:.ctor <0x0003f>
          at System.Object:runtime_invoke_void__this__ <0x00091>
          at <unknown> <0xffffffff>
          at System.Reflection.RuntimeConstructorInfo:InternalInvoke <0x000ad>
          at System.Reflection.RuntimeConstructorInfo:InternalInvoke <0x00063>
          at System.RuntimeType:CreateInstanceMono <0x00163>
          at System.RuntimeType:CreateInstanceSlow <0x00063>
          at System.RuntimeType:CreateInstanceDefaultCtor <0x0007b>
          at System.Activator:CreateInstance <0x000cf>
          at System.Activator:CreateInstance <0x00037>
          at System.Activator:CreateInstance <0x0002b>
          at log4net.Core.DefaultRepositorySelector:CreateRepository <0x00727>
          at log4net.Core.DefaultRepositorySelector:CreateRepository <0x00497>
          at log4net.Core.DefaultRepositorySelector:CreateRepository <0x0004f>
          at log4net.Core.DefaultRepositorySelector:GetRepository <0x00053>
          at log4net.Core.LoggerManager:GetLogger <0x00069>
          at log4net.LogManager:GetLogger <0x00033>
          at log4net.LogManager:GetLogger <0x0004b>
          at OpenSim.Region.ScriptEngine.Shared.Api.LSL_Api:.cctor <0x00037>
          at System.Object:runtime_invoke_void <0x00086>
          at <unknown> <0xffffffff>
          at System.Runtime.Remoting.Proxies.RealProxy:InternalGetTransparentProxy <0x00089>
          at System.Runtime.Remoting.Proxies.RealProxy:GetTransparentProxy <0x00142>
          at System.Runtime.Remoting.RemotingServices:GetOrCreateClientIdentity <0x003b0>
          at System.Runtime.Remoting.RemotingServices:GetRemoteObject <0x00043>
          at System.Runtime.Remoting.RemotingServices:GetProxyForRemoteObject <0x0009b>
          at System.Runtime.Remoting.RemotingServices:Unmarshal <0x0016b>
          at System.Runtime.Remoting.RemotingServices:Unmarshal <0x0002f>
          at System.Runtime.Remoting.ObjRef:GetRealObject <0x00043>
          at System.Runtime.Serialization.ObjectManager:ResolveObjectReference <0x0010c>
          at System.Runtime.Serialization.ObjectManager:DoFixups <0x0015f>
          at System.Runtime.Serialization.Formatters.Binary.ObjectReader:Deserialize <0x000f6>
          at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter:Deserialize <0x001a7>
          at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter:Deserialize <0x00053>
          at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter:Deserialize <0x0004b>
          at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter:Deserialize <0x00043>
          at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter:Deserialize <0x00033>
          at System.Runtime.Remoting.RemotingServices:DeserializeCallData <0x000ef>
          at OpenSim.Region.ScriptEngine.Shared.ScriptBase.ScriptBaseClass:InitApi <0x00107>
          at OpenSim.Region.ScriptEngine.Shared.ScriptBase.ScriptBaseClass:InitApi <0x0018a>
          at OpenSim.Region.ScriptEngine.Shared.Instance.ScriptInstance:Load <0x003c5>
          at OpenSim.Region.ScriptEngine.Shared.Instance.ScriptInstance:Load <0x0010f>
          at OpenSim.Region.ScriptEngine.XEngine.XEngine:DoOnRezScript <0x033bb>
          at OpenSim.Region.ScriptEngine.XEngine.XEngine:DoOnRezScriptQueue <0x00123>
          at Amib.Threading.Internal.WorkItem:ExecuteWorkItem <0x000fe>
          at Amib.Threading.Internal.WorkItem:Execute <0x0004f>
          at Amib.Threading.SmartThreadPool:ExecuteWorkItem <0x000b7>
          at Amib.Threading.SmartThreadPool:ProcessQueuedItems <0x00803>
          at System.Threading.ThreadHelper:ThreadStart_Context <0x000b2>
          at System.Threading.ExecutionContext:RunInternal <0x001ce>
          at System.Threading.ExecutionContext:Run <0x00047>
          at System.Threading.ExecutionContext:Run <0x0006b>
          at System.Threading.ThreadHelper:ThreadStart <0x0004b>
          at System.Object:runtime_invoke_void__this__ <0x00091>
=================================================================
(0037395)
Abaddon (reporter)
2020-12-18 02:23

UPDATE

Switching to YEngine seems to solve both my issues, windows timeouts, and mono crash.

I rarely see any watchdog timeouts in the console. They happen again but way less than before.

Linux also loaded the same regions with the same configuration correctly without crashing mono.

Just for the record, and while I was doing the testing, I switched the current Dev opensim code I am running now to a third party one and that also solved the watchdog timeout issue in windows.

- Issue History
Date Modified Username Field Change
2020-12-12 09:32 Abaddon New Issue
2020-12-13 00:32 tampa Note Added: 0037358
2020-12-13 01:13 UbitUmarov Note Added: 0037360
2020-12-13 03:52 Abaddon Note Added: 0037361
2020-12-13 03:55 UbitUmarov Note Added: 0037362
2020-12-13 05:04 UbitUmarov Note Added: 0037365
2020-12-13 05:20 UbitUmarov Note Deleted: 0037365
2020-12-14 07:26 Ferd Frederix Note Added: 0037378
2020-12-14 07:34 tampa Note Added: 0037380
2020-12-14 12:11 Abaddon Note Added: 0037385
2020-12-14 12:17 Abaddon Note Edited: 0037385 View Revisions
2020-12-14 16:01 Ferd Frederix Note Added: 0037386
2020-12-14 16:22 Abaddon Note Added: 0037387
2020-12-14 18:58 UbitUmarov Note Added: 0037388
2020-12-15 00:46 Abaddon Note Added: 0037389
2020-12-15 14:15 Abaddon Note Added: 0037391
2020-12-18 02:23 Abaddon Note Added: 0037395


Copyright © 2000 - 2012 MantisBT Group
Powered by Mantis Bugtracker