Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006031opensim[REGION] OpenSim Corepublic2012-05-23 12:142014-07-29 13:42
Reporterkenvc 
Assigned Tokenvc 
PriorityurgentSeveritymajorReproducibilityrandom
StatusclosedResolutionfixed 
PlatformQuad core 8 gig ramOSWindows 7OS VersionWindows 7 64 bit
Product Versionmaster (dev code) 
Target VersionFixed in Versionmaster (dev code) 
Summary0006031: 'System.OutOfMemoryException' - Simulators are consuming much more memory than in the past.
DescriptionI have been running multiple instances with multiple sims in each instance for over a year now, and over the last several months, I've had to reduce the number of instances and sims to almost half of what I could run a year ago. I am still using the opensim32bit launcher even through my machine is 64 bits because the 64 bit version consumes even more memory.

They seem to be consuming more and more memory, and now I've noticed out of memory errors again with only half as many sims and instances running. I suspect a memory leak, but this is just a guess. I almost always run the latest dev master updated on almost a daily basis.
Steps To ReproduceStartup sim instance(s) and wait.
Additional Information2012-05-23 11:39:06,516 INFO - OpenSim.Region.Framework.Scenes.Scene [SCENE]: Region Night Spot 2 authenticated and authorized incoming child agent Grelor Laval ca3e33fe-645e-41e3-9255-effea29364d6 (circuit code 2073419154)
2012-05-23 11:39:10,008 ERROR - OpenSim.Region.ClientStack.LindenUDP.LLUDPServer [LLUDPSERVER]: OutgoingPacketHandler iteration for Grelor Laval threw an exception: Exception of type 'System.OutOfMemoryException' was thrown.
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Threading.Thread.StartInternal(IPrincipal principal, StackCrawlMark& stackMark)
   at System.Threading.Thread.Start()
   at Amib.Threading.SmartThreadPool.StartThreads(Int32 threadsCount) in C:\Users\Ken\Desktop\Opensim Git\ThirdParty\SmartThreadPool\SmartThreadPool.cs:line 512
   at Amib.Threading.SmartThreadPool.Enqueue(WorkItem workItem, Boolean incrementWorkItems) in C:\Users\Ken\Desktop\Opensim Git\ThirdParty\SmartThreadPool\SmartThreadPool.cs:line 412
   at Amib.Threading.SmartThreadPool.Enqueue(WorkItem workItem) in C:\Users\Ken\Desktop\Opensim Git\ThirdParty\SmartThreadPool\SmartThreadPool.cs:line 389
   at Amib.Threading.SmartThreadPool.QueueWorkItem(WorkItemCallback callback, Object state) in C:\Users\Ken\Desktop\Opensim Git\ThirdParty\SmartThreadPool\SmartThreadPool.cs:line 764
   at OpenSim.Framework.Util.FireAndForget(WaitCallback callback, Object obj) in C:\Users\Ken\Desktop\Opensim Git\OpenSim\Framework\Util.cs:line 1665
   at OpenSim.Region.ClientStack.LindenUDP.LLUDPClient.BeginFireQueueEmpty(ThrottleOutPacketTypeFlags categories) in C:\Users\Ken\Desktop\Opensim Git\OpenSim\Region\ClientStack\Linden\UDP\LLUDPClient.cs:line 627
   at OpenSim.Region.ClientStack.LindenUDP.LLUDPClient.DequeueOutgoing() in C:\Users\Ken\Desktop\Opensim Git\OpenSim\Region\ClientStack\Linden\UDP\LLUDPClient.cs:line 555
   at OpenSim.Region.ClientStack.LindenUDP.LLUDPServer.ClientOutgoingPacketHandler(IClientAPI client) in C:\Users\Ken\Desktop\Opensim Git\OpenSim\Region\ClientStack\Linden\UDP\LLUDPServer.cs:line 1200
2012-05-23 11:39:10,168 ERROR - OpenSim.Region.ClientStack.LindenUDP.LLClientView [LLCLIENTVIEW]: Caught exception while processing OpenMetaverse.Packets.RegionHandshakeReplyPacket for Grelor Laval, Exception of type 'System.OutOfMemoryException' was thrown. at System.Threading.Thread.StartInternal(IPrincipal principal, StackCrawlMark& stackMark)
   at System.Threading.Thread.Start()
   at Amib.Threading.SmartThreadPool.StartThreads(Int32 threadsCount) in C:\Users\Ken\Desktop\Opensim Git\ThirdParty\SmartThreadPool\SmartThreadPool.cs:line 512
   at Amib.Threading.SmartThreadPool.Enqueue(WorkItem workItem, Boolean incrementWorkItems) in C:\Users\Ken\Desktop\Opensim Git\ThirdParty\SmartThreadPool\SmartThreadPool.cs:line 412
   at Amib.Threading.SmartThreadPool.Enqueue(WorkItem workItem) in C:\Users\Ken\Desktop\Opensim Git\ThirdParty\SmartThreadPool\SmartThreadPool.cs:line 389
   at Amib.Threading.SmartThreadPool.QueueWorkItem(WorkItemCallback callback, Object state) in C:\Users\Ken\Desktop\Opensim Git\ThirdParty\SmartThreadPool\SmartThreadPool.cs:line 764
   at OpenSim.Framework.Util.FireAndForget(WaitCallback callback, Object obj) in C:\Users\Ken\Desktop\Opensim Git\OpenSim\Framework\Util.cs:line 1665
   at OpenSim.Region.ClientStack.LindenUDP.LLClientView.SendLayerData(Single[] map) in C:\Users\Ken\Desktop\Opensim Git\OpenSim\Region\ClientStack\Linden\UDP\LLClientView.cs:line 1084
   at OpenSim.Region.Framework.Scenes.SceneBase.SendLayerData(IClientAPI RemoteClient) in C:\Users\Ken\Desktop\Opensim Git\OpenSim\Region\Framework\Scenes\SceneBase.cs:line 175
   at OpenSim.Region.ClientStack.LindenUDP.LLClientView.HandlerRegionHandshakeReply(IClientAPI sender, Packet Pack) in C:\Users\Ken\Desktop\Opensim Git\OpenSim\Region\ClientStack\Linden\UDP\LLClientView.cs:line 5980
   at OpenSim.Region.ClientStack.LindenUDP.LLClientView.ProcessSpecificPacketAsync(Object state) in C:\Users\Ken\Desktop\Opensim Git\OpenSim\Region\ClientStack\Linden\UDP\LLClientView.cs:line 681
2012-05-23 11:41:19,618 ERROR - OpenSim.Region.CoreModules.Scripting.LoadImageURL.LoadImageURLModule [LOADIMAGEURLMODULE]: OpenJpeg Conversion Failed. Empty byte data returned!
2012-05-23 11:41:55,604 ERROR - OpenSim.Region.CoreModules.Scripting.LoadImageURL.LoadImageURLModule [LOADIMAGEURLMODULE]: OpenJpeg Conversion Failed. Empty byte data returned!
2012-05-23 11:42:01,893 ERROR - OpenSim.Region.CoreModules.Scripting.LoadImageURL.LoadImageURLModule [LOADIMAGEURLMODULE]: OpenJpeg Conversion Failed. Empty byte data returned!
TagsNo tags attached.
Git Revision or version number19094
Run Mode Grid (Multiple Regions per Sim)
Physics EngineODE
EnvironmentUnknown
Mono VersionNone
ViewerN/A
Attached Fileslog file icon Opensim.log [^] (14,920 bytes) 2012-07-23 12:15

- Relationships
related to 0006030new Mono resource allocation grows rapidly while running opensim until CPU cannot function 
related to 0003353feedbackJamenai 100% CPU usage from the mono process 
related to 0007002closedmelanie Use signaling instead of CPU-blocking sleep for LSL events (listen, timer, sensor and dataserver). 

-  Notes
(0021495)
usalabs (reporter)
2012-05-23 12:19

It looks like it happens when you login to a region.

Does it use up memory when standing idle?

In windows open task manager and click on 'Performance' tab
(0021498)
kenvc (reporter)
2012-05-23 12:21

In general, significant more memory is needed to run the simulators even when they are sitting idle than a year ago, but the memory does for sure spike higher when someone logs in or teleports in.
(0021500)
Pixel Tomsen (manager)
2012-05-23 12:28
edited on: 2012-05-23 12:31

please try/test OpenSim.exe not the 32Bit-launcher, opensim has support to launch 64bit-libs / Environment for Windows ;-)
...64bit eat for windows more memory!, but result finally no exceptions ( for me);-)

(0021518)
justincc (administrator)
2012-05-23 16:51

Also see http://opensimulator.org/mantis/view.php?id=6030 [^]
(0021522)
kenvc (reporter)
2012-05-23 20:19

Justin, I don't use Warp3D at all. I tried it before and it consumed so much more memory it was really unusable with as many sims as I run.
(0021528)
aiaustin (developer)
2012-05-24 14:31

I also noticed an essentially idling instance of OpenSim.exe got an out of memory error in a recent update. I do use Warp3D map tile rendering... its the only one that looks decent.
(0021529)
WhiteStar (reporter)
2012-05-25 16:26

confirmed as well on odb60ee-r/19080.

Mem leak is causing a bloat situation. @ startup, of my SA I'm using a total of 712 MB and it slowly climbs up to 1024MB in 4 days of running. Minimal traffic as only 2 of us use the SA. 7 regions, Approx 1400 scripts, roughly 60K in prims.

Prior to this version (4032455-r/18915)@ startup it would be running with 524 MB ram.
(0021530)
justincc (administrator)
2012-05-25 16:51

Could you narrow it down further?
(0021537)
usalabs (reporter)
2012-05-25 18:17
edited on: 2012-05-25 18:51

@ kenvc

Could you paste the outputs of 'show stats' on each region console, especially the bottom part of it showing object memory allocation and processing memory, during idle and when someone TP's/logs in?

After starting a region, type show stats in the console,,, then have someone TP or login then type show stats again to compare the outputs from each state.

To clarify my observations.

One of my 9 regions has 2K+ prims and 345 scripts, and at idle memory usage is:-

MEMORY STATISTICS
Allocated to OpenSim objects: 61 MB
Process memory : 100 MB

and when I login to that region and after 2 hud scripts have been loaded, the memory usage is:-

Allocated to OpenSim objects: 81 MB
Process memory : 132 MB

I only see a small increase, but then again, I'm using Linux and mono 2.10.6 to run opensim,,, which after what I've been seeing on this report,,, it seems to only affect windows,,IMO windows has a very bad memory management system.

What's appearing to happen, is that when someone logs out,,, the memory previously allocated does not put back what was used during the login/TP, but instead increases with each login/TP,,, the only way to return the previously allocated memory, is to shut down the region(s), and reboot, or if using linux initiate a small script to clear the system cache.

EG:-
Lets say for arguements sake, someone logins in with scripted attachments, and a loaded hud,,, which maybe, takes 100K memory, when that user logs out,, that 100K is not returned to the system, but instead is cached, and as a result, when that user logs in again, the memory allocation isn't taken from the cache, but again taken from system memory, then cached again when logged out,,,, this cycle continues, until the system ram is about 80% used, and then the swap file (Page File in Linux) is activated, and used,,, that is what restricts running multiple regions on 4GB memory or less.

IMO when a user login, opensim should track the memory used during the login, and when that user logs out,,, the memory being used during the login should be returned to the system, and not cached, only to build up and build up, till some systems will throw an 'Insufficient Memory' error.

(0021557)
kenvc (reporter)
2012-05-25 22:28

I will try to do additional test as suggested later as time allows, but I did find some old info from about a year ago.

The instances running sims that have had no additional prims added during the last year are consuming almost double the memory they consumed after startup about a year ago with no one logged in. These are private regions and I have added no additional prims during the last year. I did not consider regions where I did add prims this last year because that would not be comparing apples to apples.

This seems to indicate there is a significantly larger memory requirement even without AVs logging in compared to about a year ago.
(0021564)
kenvc (reporter)
2012-05-26 17:34
edited on: 2012-05-27 21:05

I've been running only half as many sims and instances as I was a year ago and I'm still getting out of memory exceptions, so I'm now forced to reduce the number of sims and instances even further in order to keep them running.

It appears Whitestar found that this issue for him seemed to become noticable around rev 18915. I started seeing this problem to some extent prior to that, but around that time is when it seemed to become much more noticable. Most likely it was more noticable to me because I have always pushed my systems to the limit for testing purposes in order to help expose bugs.

(0021566)
Gwyneth Llewelyn (reporter)
2012-05-28 03:44

I'm running 0.7.3.1 on Ubuntu 10.4, compiled with Mono 2.10.8.1, on a low-end, old HP desktop, which has only a 2-core CPU and 2 GBytes of RAM. 0.7.2 worked flawlessly without memory problems, and CPU usage idled around 30%, which is more than acceptable and consistent with what I've seen in the past.

Starting with 0.7.3.1, however, memory consumption for the exact amount of sims and prims (just an upgrade) pretty much doubled, forcing the system to start aggressively using swap space, and, consequently, running at 150-300% CPU load (!), with overall performance decrease. Memory statistic consumptions changed from 0.7.2 to 0.7.3.1, so I'm not sure how they are related nowadays:

On 0.7.2 I had:

MEMORY STATISTICS
Allocated to OpenSim : 622 MB

On 0.7.3.1, with exactly the same database etc., I get:

MEMORY STATISTICS
Allocated to OpenSim objects: 625 MB
Process memory : 1056 MB

This, however, corresponds to about 2.4 GB of virtual memory, thus forcing the kernel to swap OpenSim to disk with the consequent overall slowness.

What I'm going to do is to split my grid into several instances, since this allegedly will fix things by spreading memory consumption across the multiple instances. A pity, however, that this is not a very easy thing to do (e.g. requires separate directories for each instance, meaning that upgrading them all will be a pain in the future).

I'll be happy to report any significant performance increases when I finish that configuration :)
(0021568)
usalabs (reporter)
2012-05-28 05:02
edited on: 2012-05-28 05:11

I recently got hold of a Celeron 2.6Ghz single core system with 2GB RAM, for testing, to which I installed OpenSuSe 11.4 and mono 2.10.8, with 5 instances of 4 single regions and the 5th is a 2x2 megaregion.

Region 1 is configured for max 20K prims and contains 17,378 prims and 439 scripts using 124MB memory.

Region 2 is a default max prims (15K) containing 6,139 prims and 36 scripts using 60MB memory.

Region 3 default 15K max prims containing 38 prims and 25 scripts using 29MB memory

Region 4 (2x2 megaregion) default 15K max prims, containing 99 prims and 33 scripts using 21MB memory

Region 5 default max prims (15K) containing 233 prims and 156 scripts using 246MB memory.

Total memory available from 2GB = 1.132GB, swap file not being used,,, all regions at idle.

and all that with a single core P4 with 2GB ram,,,,so, basically I really haven't noticed any excessive memory usage, except the caching of previously allocated memory, which is not being returned back to the system. To get a true reading of available system memory, I had to clear system cache before executing top.

Oh,,,and all instances running Version: OpenSim 0.7.4 Dev OSgrid 0.7.4 (Dev) 4186fa1: 2012-05-07 (interface version 7)

(0021569)
melanie (administrator)
2012-05-28 05:09

How did you cause mono to release the held memory back to the system?
(0021570)
usalabs (reporter)
2012-05-28 05:16
edited on: 2012-05-28 05:35

This only works in Linux.

Create a bash script called clear-cache.sh using this code and place it in the root directory as it can ONLY be run as root using sudo:-

#! /bin/bash

sync
echo 3 > /proc/sys/vm/drop_caches

make the script file executable:- chmod +x clear-cache.sh

it can be run manually using sudo /root/clear-cache.sh or added to crontab to be run as root every 3 days

When it's run as a cron job, and an av is inworld when the cache is cleared, there is no noticeable degrading in performance.

Only unused memory is returned to the system, which means any av's logged in will not be affected, and when av's come and go, and the script is executed,, all that memory previously allocated to those av's would be returned back to the system.

(0021571)
usalabs (reporter)
2012-05-28 05:54
edited on: 2012-05-28 05:55

Another step to take for Linux Users ONLY, is, if lots of av's are logged in on many regions, and the Linux Swap File kicks in, a new bash script can be created using the below code:-

#! /bin/bash

swapoff -a ## Clears swap files/partitions
sync
echo 3 > /proc/sys/vm/drop_caches
swapon -a ## Rebuilds empty swap files/partitions

This can be used when all regions are av free during a routine maintenance.
It is NOT advised to use this script when av's are logged in and the swap file is actually being used, it could crash the simulator(s).

I tested it on my server when there is no-one logged in, by issuing at each instance console:- show users
then run the script, it did not affect any currently running instances.

The previous script file (clear-cache.sh) is safe to run with or without anyone logged in.

(0021575)
Gwyneth Llewelyn (reporter)
2012-05-28 11:06

Heh usalabs... that sounds rather dramatic. It certainly doesn't hurt, though. I've tried it, and opensim most certainly didn't crash. The server didn't improve its performance, though — CPU usage is still very high (looks like even higher than before — load went from 1.8 to 2.4!). Memory consumption as reported by OpenSim dropped from 625 MB to 613, then back to 639, up to 650 and increasing... however, top still just reports 2440 MBytes of VRAM, same as before.

This is still at 0.7.3.1, of course; I'll be only switching to the "latest and greatest" nightly build if there is a good reason to presume that 0.7.4 actually fixed something!

In the mean time I have to assume that it's one of the myriad of settings that is causing the whole load. The hint about map tiles was intriguing. But there are many more things to check for...
(0021576)
usalabs (reporter)
2012-05-28 11:50
edited on: 2012-05-28 12:07

@ Gwyneth

When you run the server, are you using an X11 windows system, such as Gnome, KDE, etc etc?, if so,,, then it's expected the CPU load will be higher if running a desktop and opensim, and Xorg will use a lot of CPU resources.

I run my server using OpenSuSe 11.4 on runlevel 3, no X11, and access only through local SSH via PuTTY, totally CLI, also, if you are running multiple instances, and run top, have a look at the CPU usage along side each mono entry.

What version of mono are you using?, and did you compile it from tarball source? or is it from the distro repos?

I ask this because I wait until the stable release for OpensuSe appears, then download the rpm, I've always had problems after I compile from source.

A consideration could be not a problem with opensim, but mono could be the culprit.

(0021587)
Gwyneth Llewelyn (reporter)
2012-05-28 17:51

@usalabs — no, no X11, no GUI, nothing that wastes any memory :) It's a straight Ubuntu *server* installation which doesn't install any of that. So, beyond MySQL, the machine only runs Apache, ssh, and nothing else. When filtering top for CPU %, Apache and MySQL don't even appear on the top 40 processes... also, I notice that the CPUs are mostly idle (80%) which is actually suspicious for an overloaded machine. I've stopped Ubuntu's constant cleaning up of PHP5 sessions (happened every half hour) to see if that wasn't affecting the overall performance, but the point is, this was already running under 0.7.2, which had absolutely no memory/CPU overload problems.

There aren't any "obvious" errors on either the system logs or the OpenSim logs. Unlike @kenvc, I don't get any "System.OutOfMemoryException" errors. The server itself also seems to be happy. It just has an abnormal load.

As said, OpenSim 0.7.3.1 was compiled from the source tarballs with Mono 2.10.8.1 under Ubuntu 10.4.

I've patiently spread the load of the 22 regions among several instances (all running on the same server). Now, individual memory consumption obviously went down — no single process consumes more than 2 GB of VRAM (in fact, the biggest one is at around 1700 MBytes).

Unfortunately, like @kenvc reported, this seems to make matters even worse :-( To be more specific: overall CPU load went to twice as before (e.g. now 4-5 load is usual, when an avatar logs in, and it's consuming even more swap space than with the single-instance configuration). The advantage is possibly that now avatars logging to different parts of the grid will just overload the simulator processes they're in and not affect the others, but that's little comfort...

Note that although this grid is mostly for internal development (even though It's HG-enabled), it runs just a handful of scripts, but it has *lots* of prims. Overall, I think that there more than 100K prims spread over the 22 regions; some of which have 15K prims. This obviously will consume memory. The issue here is — why the sudden performance change from 0.7.2 to 0.7.3.1?

At this stage, there are several possible answers. As said, I've recompiled 0.7.3.1 with Mono 2.10.8.1; 0.7.2 was compiled with Mono 2.6.4, I think. That's one possible issue. Then there are slightly different configuration changes from 0.7.2 to 0.7.3.1: while I usually don't test new features which I won't be using, there might be new defaults which I haven't changed. A more thorough approach would be to turn off each feature, one by one, until 0.7.3.1 behaves just like 0.7.2 :( The difficulty of this approach is that it's very time-consuming...
(0021591)
WhiteStar (reporter)
2012-05-29 10:18

With regards to Warp3D being a contributor to the issue, I have performed some extra tests and cannot find a relation to the mem bloating currently happening. The mem bloat occurs regardless of Warp3D or Standard maptiles are being used.

For more info see http://opensimulator.org/mantis/view.php?id=6034 [^]

These two issues should likely be related.

I have not seen any other possible contributor to the mem bloat.

Platform testing against:
Win Vista/32bit with Net 3.51 updated to full Net 4.0
Source compiled on that system with Net 3.5\msbuild

Side note: Windows GC process does & always has worked properly. Prior to MONO 2.10, Mono had several issues with GC (Garbage collection) which is well known by Mono dev's & OpenSim devs.

@ Justin: If there is anything specific you would like me to test or try to further narrow it down, just ask and I'll do my best. Of course the current set of tests I can do are only on 32bit Windows as my other systems are busy with other things for the time being.
(0021593)
Gwyneth Llewelyn (reporter)
2012-05-29 12:13

Testing with Windows should me more than fine, @WhiteStar — after all, it's clear that the OS doesn't seem to make a difference in this case (although usalabs reports few or no problems with OpenSuSe...). But the Mono version might make a difference!

I have this feeling that it might just be an option that is now "default" on recent versions of OpenSim which did not exist (or was turned off) on 0.7.2 — or was rewritten from scratch!

In the mean time, I did a few more tests. Since the CPUs themselves are around 80% of time idle BUT top reports a very high load, this *could* be related to excessive disk activity. That's why my first suspicion was that swap space was getting scarce. Since I've partitioned the grid among (six) OpenSim processes, each fits neatly inside RAM and don't need any swap space, BUT the load didn't decrease (in fact, it went up).

So I've been running iotop, which should give a list of processes that are aggressively writing to disk or at least waiting for the disk to get data. To my utter surprise, the disk is pretty much unused (which is actually consistent to my visual observation of the PC — the disk LED doesn't blink and I don't hear the disk being spinned). So I'm at a loss to understand, from the perspective of the operating system, where exactly the CPU is spending so much time.

I've also tried to change MONO_THREADS_PER_CPU to a way higher number, but since Mono uses now the new ThreadPool mechanism, it shouldn't make any difference. And, indeed, it doesn't :(
(0021594)
usalabs (reporter)
2012-05-29 13:05
edited on: 2012-05-29 13:15

I recently upgraded opensim on the Celeron single core testing machine, to Version: OpenSim 0.7.4 Dev OSgrid 0.7.4 (Dev) 0db60ee: 2012-05-20 (interface version 7) and it's now been running for 15 and a half hours since initial starting of upgrade.

Now, this time,,, in top, I noticed one of the mono processes using 76% cpu, so, matching it's PID with one of the screen processes, I found it's my 2x2 megaregion, with only 99 prims and 33 scripts, all other mono processes are using between 5.5 and 6.5% and between those percentages, one of the regions has 17,378 prims and 439 scripts, and only use 5.5% cpu.

All scripts on the megaregion are accounted for, and the megaregion stats:-

MEMORY STATISTICS
Allocated to OpenSim objects: 39 MB
Process memory : 95 MB

and still it uses 76% cpu,

If that isn't confusing I don't know what is.

I don't notice any excessive memory consumption, it's the cpu usage that's worrying me now.

(0021595)
usalabs (reporter)
2012-05-30 03:32
edited on: 2012-05-30 03:34

I can definitely confirm it is mega regions that increase CPU usage, I shut down all regions and created a new empty 2x2 mega region, and the only mono instance running is the mega region, it was running for 4 hours, I checked top 3 hours ago, and it was showing the only mono instance running was using 74% CPU usage, I then tested again by shutting down the mega region, disabling mega region setting in OpenSim.ini (CombineContiguousRegions = false) then restarted opensim, and now, 3 hours later, the CPU usage is showing 3.5%, which does confirm that for some reason mega regions do increase CPU usage dramatically.

Times are based on MST.

(0021596)
Gwyneth Llewelyn (reporter)
2012-05-30 06:22

Hmm. I have no mega regions, so there is something else happening here.

Looking at top for extended periods of time, what I notice is that, on a multiple instance setup, individual Mono processes aren't misbehaving that badly: their CPU percentage, individually taken, are below 12% (for the instance with the most prims) or even below 10% (for the remaining instances, which have less prims). This would not be very bad in itself.

However, during all that time, the (2-core) CPU's load average is always above 3 and spikes to 4. This is clearly related to Mono/OpenSim — no other process is demanding the CPU(s) attention. If I kill OpenSim, the remaining processes will basically contribute zero load to the CPU. A single OpenSim process, however, will immediately push the load average close or above 2, and more OpenSim processes will only make things worse.

What does this mean? Load average shows how many processes are waiting on the queue (adjusted for the number of CPUs). With one CPU, a load average of 1.0 means that this CPU is able to fully process all requests on the queue optimally — no process is waiting for CPU, and no CPU cycles are wasted (i.e. idle). A load average of 2.0 (on a single-core system( would mean that half the processes would be waiting for a chance to run on the CPU :-( A load average of 0.33 would mean that there is so little load on the system that 2/3 of the time the CPU is basically doing nothing but waiting for more processes to run (this was the kind of load I had under 0.7.2).

The CPU percentage shown by top is a measure of how often a single process is loaded by the kernel to run on the CPU, averaged over a period of time. So 350% on a 4-core system means that, for a period of time, a process would be taking over three full CPUs and half the time of the fourth CPU — i.e. it has so many threads to run that it pretty much swamps the 4 cores with requests. CPU percentage doesn't really take into account the waiting queues (while load average does): it's an after-the-fact statistic of what happened in a certain frame of time. But it certainly shows which processes have been using the CPU(s) recently.

A thorough article on this subject: http://www.linuxjournal.com/article/9001 [^]

I had noticed that one of the instances was consuming around 25% of all available memory (the others consume about 10%), so I split the regions on that instance across another instance. This did decrease memory consumption slightly, but made no difference on the load average.

So... what makes Mono or OpenSim flag all CPU queues for running, but, once being run, actually consumes little CPU, adequate (not overwhelming) memory, and pretty much doesn't touch the disk? This is a tricky question! It is as if Mono or OpenSim are telling the CPU that they have an urgent need to run, but, once selected by the task scheduler to actually run, they find that they don't have the resources they need, enter an idle state and are swapped out for another process. This happens continuously and the scheduler is confused: why are there so many processes actually using so few resources constantly "demanding" the scheduler's attention?

One thing that comes to mind is that the system is starved for *one* resource, which *all* Mono processes require, and, once being selected by the scheduler to run, the process finds out that resource is not available and gets swapped out for idleness. What resource might that be? Since I'm tracking memory, virtual memory, swap space, and CPU, and all of those seem to be fine (and no errors on the logs either!), it can't be any of the above. I suspected that perhaps the Mono new SmartThreadPool was being unable to handle all requests and increased the default from 15 to 25 — no difference.

So... stumped. Unless the hard disk was failing (but without giving any errors!), which would account for a lot of disk I/O requests being basically ignored (which would increase the load average as all those processes would be waiting for a faulty disk). But usually you get *some* feedback (on the logs) due to a faulty disk...

Finally, I remembered that I was behind NAT, but, because I wish to allow my grid to be HG-accessible (as well as accesible for some colleagues who are outside the LAN), I have all configurations with the actual DNS name (e.g. opensim.betatechnologies.info). This means that OpenSim has to do quite a lot of requests to external IP addresses and open connections to itself that go twice through the router! I remember that very good routers (from Cisco, for example) will notice the "twice through" path, and start resolving addresses to the *internal* LAN IP address and not the *external* one. But possibly my old, low-end Linksys (even though Linksys is now a Cisco brand) is not doing that.

And there is really a LOT of traffic between the OpenSim instances, the ROBUST instance, and MySQL, specially when an avatar pops in.

So, I thought, let's fix that. I just need an entry on /etc/hosts and place my LAN address pointing to opensim.betatechnologies.info. I've restarted the grid and... the load average started to fall. Dramatically so!

For good measure, I just added a few more configuration tweaks on the whole network subsystem: http://www.cyberciti.biz/faq/linux-tcp-tuning/ [^]

I felt like kicking myself, this should have been obvious to me... when CPU, memory and disk are all fine, the culprit is most often the network! I guess I'm getting old and out of touch with system administration :-(

So, well, I'm not yet at 0.7.2 load levels. Not yet! But at least load average falled WAY below 2 (meaning that on my two-core system none of the CPUs are at full load), even when avatars are logged in. I don't have a large safety margin to deal with spikes, but at least all Mono instances are not starved for resources — at least not *network* resources! — and sometimes a few of the Mono instances don't even show up on top (my screen only lists the first 44 processes or so). This is very encouraging!

I wonder if some of you could replicate this and see if it works.

To recap:

- if you're using an external DNS name for your grid and use DynDNS or something similar to redirect it through your NAT-enabled router/firewall, see if adding the LAN IP address pointing to the DNS name on /etc/hosts (under Windows XP and above it's under %SystemRoot%\system32\drivers\etc\hosts) makes a difference
- tweak sysctl to improve network performance. See the above article for changes under Linux. Mac OS X uses sysctl too (see http://simplestation.com/locomotion/speed-up-mac-os-x-leopard/ [^]). Windows XP users have the sysctl settings implemented in the registry (http://www.speedguide.net/articles/windows-2kxp-registry-tweaks-157 [^]) and obviously can benefit from larger buffers as well

Now I need to do some tests external to the LAN to see if everything is working properly; if so, from my point of view, this issue is closed :-)
(0021597)
usalabs (reporter)
2012-05-30 07:08

@ Gwenyth

The original report of this incident by kenvc said he was running multiple instances with multiple regions, and as I don't see any reference to ROBUST in his report, you may have found a new issue concerning ROBUST.

I run my regions as a type of 'branch off to OSGrid' and I don't run ROBUST, but I'm applying those tweaks to my Linux server and if there's a significant difference than previous, the windows tweaks may work for kenvc's issue.

FYI, for windows users there is a small program that can set optimal tcp window sizes for you,,, all that has to be done, is move the slider to ISP advertised speed, then hit optimize, it'll change all the registry setting for you, it's called 'tcp optimizer' it can be downloaded at SpeedGuide.net:- http://www.speedguide.net/downloads.php [^]
(0021599)
usalabs (reporter)
2012-05-30 14:47

Update:

The Linux TCP tweaks did nothing, it is definitely mega regions that really spike CPU usage as much as 76%,,, so, for now, I have shut down my mega region, pending the outcome of further testing.
(0021601)
Gwyneth Llewelyn (reporter)
2012-05-31 02:44

Update:

Bad news: changing /etc/hosts is a BAD idea if people log in outside the LAN, because in some cases they will get the internal LAN IP address instead of the DNS address and thus fail to connect :-( So there goes my frustrated attempt at eliminating networking issues... back to the drawing board.

The Linux TCP tweaks *did* make a small difference, but clearly it's not constantly below 2 as it was with the /etc/hosts change :-( As @usalabs pointed out, this MIGHT be related to some increased networking activity between the regions and ROBUST which wasn't present in 0.7.2 but started popping up in 0.7.3.1 and beyond.

In my case I have no mega regions :-( @usalabs, are you able to launch a ROBUST instance just to see how that impacts your performance?

I don't see that much activity between the OpenSim instances and ROBUST logged to the logs (they're at DEBUG level).

Since I've split things between instances, and increased the ThreadPool limit to 25, memory consumption on each instance as reported by the console has diminished drastically — Allocate to OpenSim objects is now around 200 MB - 250 MB for instances with over 15K prims, and Process memory some 100 MB above that. An "empty instance" (two regions without a single prim) report 23 and 74 MB respectively. I've finally figured out that Process Memory corresponds nicely to the non-swappable physical memory used ("RES" or "resident size" on 'top'); in my case, the total for all instances + ROBUST is around 1.6 GBytes. Virtual memory for each instance (including ROBUST) is way higher — 1-1.4 GBytes (!), pretty much the same as MySQL. This tends to show that splitting sims among several instances might not be such a good strategy at all (even though there is a big advantage of allowing all instances with zero avatars and no script activity to be scheduled out and remain — theoretically! — idle, while keeping CPU + memory just for the "active" regions, i.e. where avatars are actually doing something)

I'm now collecting some data from sar to see if I can pinpoint what exactly is going on. So far, no luck. There are a lot of context switches and the OpenSim threads are constantly waiting on each other for futexes, but that's not unusual for a multi-tasking, multi-threading environment. Except for the high load average (and consequent poor overall performance), the rest seems "normal" to me...
(0021602)
usalabs (reporter)
2012-05-31 06:16
edited on: 2012-05-31 06:21

@ Gwyneth

After mountain climbing in my large closet, I pulled down an AMD Athlon X2 X64 system with 4GB ram, I installed a 32 bit OS, (OpenSuSe 11.04), mono (complete) 2.10.8, then proceeded to compile opensim 0.7.3.1 and configured ROBUST to have its own DB,,,, then I setup 4 instances, each having their own DB, and each instance running 4 regions, (no mega regions), 9 of the 16 regions I loaded up a few OAR's.

Now,, here's the result, with ROBUST, MySQL, mono and opensim all running on the same system, the CPU usage hit the roof and locked at 200% (100% per core), memory consumption was between 10.5 and 11.2%,, soooo, I did a bit more testing, and pulled down (from the closet) an older P4 1Ghz single core system with 1GB ram, to this I installed the same O/S, but this time I set it up for MySQL.

Then I re-configured ROBUST and opensim to use the other PC as the DB server, which did bring the load down a lot,sooo.

IMO, trying to run ROBUST, MySQL, mono and opensim all on the same system is like trying to get a computer to run need for speed, diablo, SL, and telling it to work out PI to the nearest millionth decimal place, all at the same time, or even trying to pull a boing 747 with a mini coupe.

By transferring some of the load to another system, the CPU usage fell quite a lot, from the stuck 200% to between 10 and 13%

I may even test with ROBUST on yet another system, so that, EG,,,server1 = ROBUST, server2 = regions, server3 = DB. Which may even reduce the load yet even more,,, but,,,,,there may be a problem with network bottleneck, but we'll see, after I've completed further testing.

(0021603)
kenvc (reporter)
2012-05-31 11:20
edited on: 2012-05-31 11:24

The increased memory usage on my system is not related to mega regions because that computer has no mega regions running on it. CPU usage is not really an issue for me although it might be worse too. I'ts the significant increase in memory required over the last year or so that is eating me up.

The best way to isolate this might be to do a git bisect, but it would probably be over a large time period anddatabase changes and config changes that have been done during that time period could easily cause issues when reverting to a version that was very old.

(0021604)
Gwyneth Llewelyn (reporter)
2012-05-31 15:32

@usalabs You're right — I'm pushing the server to its limits :) But it worked so well under 0.7.2...

@kenvc What I can see is that all instances, put together, are using over 6 GBytes of VRAM... while I just have 2 GB of physical RAM. This necessarily means a lot of memory swapping around, which *can* account for the increase in CPU (as the CPU is busy waiting for all those swaps to disk...). Nevertheless, the point is that this didn't happen under 0.7.2 — everything would fit neatly way below the 2 GBytes of physical RAM, there was no swapping, and load average was at a comfortable 0.3, peaking to 0.7 when avatars would log in. That's the experience I have had for months and months until 0.7.2.

0.7.3.1 (and apparently everything above that) just changed everything. Now the instances became superbloated for some reason — I'm still trying to figure out why, but it's NOT due to megaregions in my case — and excessive RAM consumption will start to break apart everything else.

I wonder how many revisions existed between 0.7.2 and 0.7.3.1... probably quite a lot!
(0021606)
Gwyneth Llewelyn (reporter)
2012-05-31 16:12

I've tried disabling the MapImageModule as suggested in http://opensimulator.org/mantis/view.php?id=6030. [^]

It seemed to have improved performance... for a while, at least. It's not dramatic really. However, I've noticed that @kenvc's original report shows an out of memory error at LoadImageURLModule, when converting images. What does the LoadImageURLModule do? Is it related to the map functions or do the map functions call LoadImageURLModule to send map tiles to the SL Viewer? If to, issues 0006030 and 0006031 might be related: somehow there is something consuming far too much memory to generate map tiles in some cases. Does this make any sense?

FYI my map tile config:


    ;; Map tile options. You can choose to generate no map tiles at all,
    ;; generate normal maptiles, or nominate an uploaded texture to
    ;; be the map tile
    GenerateMaptiles = true

    ;; If desired, a running region can update the map tiles periodically
    ;; to reflect building activity. This names no sense of you don't have
    ;; prims on maptiles. Value is in seconds.
    ; MaptileRefresh = 0

    ;; If not generating maptiles, use this static texture asset ID
    ; MaptileStaticUUID = "00000000-0000-0000-0000-000000000000"

    ;# {TextureOnMapTile} {} {Use terrain textures for map tiles?} {true false} true
    ;; Use terrain texture for maptiles if true, use shaded green if false
    TextureOnMapTile = true

    ;# {DrawPrimOnMapTile} {} {Draw prim shapes on map tiles?} {true false} false
    ;; Draw objects on maptile. This step might take a long time if you've
    ;; got a large number of objects, so you can turn it off here if you'd like.
    DrawPrimOnMapTile = true

and:

    WorldMapModule = "WorldMap"
# MapImageModule = "MapImageModule" # commented out as suggested on issue 0006030

and on config-include/GridHypergrid.ini: WorldMapModule = "HGWorldMap"

Should I try getting rid of the map altogether? :-) So far, turning MapImageModule off *has* brought a performance increase. It's still a far cry from 0.7.2, but it's better than nothing.
(0021608)
Gwyneth Llewelyn (reporter)
2012-05-31 17:48
edited on: 2012-05-31 17:48

Update:

Turning MapImageModule off does REALLY make a huge difference! Now I start to see a pattern here:

- Megaregions use the map differently, so somehow this impacts memory/CPU consumption
- LoadImageURLModule is indirectly connected to the generation of tiles for map display, so it also impacts memory consumption (see @kenvc's original report)
- Warp3DImageModule for Warp3D map tiles have a huge impact on memory (reported elsewhere) and it's better to keep it turned off (see issue 0006030)
- MapImageModule is an "old" default somehow which can be (?) safely removed. It also has a huge impact on memory and on CPU load when on.

I'm now consistently at _half_ the CPU load I have with MapImageModule turned on. This means that both CPUs are now constantly below 100% usage — even with some avatars logged in. Of course there are the occasional spikes, but that's ok. There is still a bit of CPU to spare to deal with them.

So whatever is causing this, it's related to the map module(s) somehow. Either it's the OpenJPEG library (hinted on @kenvc's original report) or something that calls it from the map module.

(0021610)
justincc (administrator)
2012-05-31 21:06

I've only had a chance to read part of the comments but I can say

1) It looks like map image might be in some way responsible. In theory it should only run once at startup but I'm not sure if older settings made it run continuously.

2) From all I've heard it might be the case that Warp3D does especially leak memory in some way

3) If you get an OutOfMemoryException then it won't be related to whatever bit of code it came from - anything could go to the heap and find nothing left due to other memory usage.

4) Megaregions do throw a spanner into the mix - it's always best to try first with these turned off.

5) A git bisect is absolutely the best way to help find these issues. It's the easy way to find the needle in the haystack.

6) There are a very large number of revisions between 0.7.2 and 0.7.3.

7) Excessive CPU consumption is very likely script related. Try the usual fault finding stuff of running with scripts off, etc. Possibly there should be a wiki page of instructions for such fault finding.
(0021611)
usalabs (reporter)
2012-06-01 04:32
edited on: 2012-06-01 04:51

If anyone wants to try a backward check, to see which version initially created the increase in resources,,, previous versions of the opensim source is available at http://opensimulator.org/dist [^] in order of earliest first to the most recent.

@Gwenyth

If you have the time, maybe starting from the most recent and working back through the versions may find which version started the issue.

Each one would have to be downloaded, compiled, and configured to the same configuration you're using now, and tested against 1 version back.

I would do it, but I'm putting everything on hold right now, as my wife will be having heart bypass surgery on Monday.

@justincc

Quote "7) Excessive CPU consumption is very likely script related. Try the usual fault finding stuff of running with scripts off, etc. Possibly there should be a wiki page of instructions for such fault finding."

In my case of testing ROBUST, I did comparisons between empty regions and loaded ones with OAR's, even 4 empty regions on one instance, the CPU usage went sky high, at around 65% and no other instances running,, so,,in this case scripts were not to blame, mega regions even make it worse, a 2x2 near empty mega region shot the CPU to 76%.

Now I'm running 3 instances of 2 regions per instance (no mega regions), some regions have 200+scripts, and over 6K prims, and others barely anything, and yet each instance only consumes an average of 3.5% CPU, and when an av logs in, CPU only get as far as about 5%, and that's using Version: OpenSim 0.7.4 Dev OSgrid 0.7.4 (Dev) 0db60ee: 2012-05-20 (interface version 7), but starting up ROBUST and configuring the regions to it, shoots the CPU usage into a 200% lock (100% per core), and forces me to perform a hard reset.

(0021612)
BlueWall (administrator)
2012-06-01 09:32

Actually, if you clone the git repo, you have every commit of OpenSim all the way to the initial commit. Then the best way to locate a change point is with git bisect. It will checkout versions in the best way possible to find the commit that introduces the change and it keeps a log as it goes. Here is a little tutorial on that: http://webchick.net/node/99 [^] .

If you want to build/run from your local repo directory...

Between each version, do these steps (for mono platform) :

xbuild /t:clean
echo y|mono bin/Prebuild.exe /clean
rm -R bin
git checkout bin
./runprebuild.sh
xbuild

You can also build/run in another directory by copying the files in your local repo to another directory.

Doing this locally will be faster for you and will help keep bandwidth usage down at opensimulator.org.
(0021625)
justincc (administrator)
2012-06-05 20:20

If you think you have issues here related to map image generation, you may want to try git master code as of commit 514dd85. This makes sure all Bitmaps used by map image generation are explicitly Dispose()d but I don't know if this will help.

I also suspect there are multiple causes of mem leak.
(0021669)
kenvc (reporter)
2012-06-18 17:23

I do not have MapImageModule enabled in any of my configs so the increased memory issue is coming from somewhere else for me. The memory issue is much more noticable to me than a CPU issue. I think these are 2 different issues.

I'd love to do a git bisect, but I've only been home about 3 weeks this year because of work with a month trip starting tomorrow.

Wouldn't it be difficult/impossible to do a git bisect that goes back too far because of database field changes and config changes that have been implimented during that time?
(0021670)
Gwyneth Llewelyn (reporter)
2012-06-18 17:42

I wish I had time to do the requested git bisect, too. It seems that the issue is really worth pursuing.

In my case, the memory issue is closely CPU-related just because I have am underpowered server — as soon as memory is exhausted, the system has no option but to swap processes out much faster and then use VRAM and start paging to disk, which will also increase I/O (and get processes waiting on the queue with a mostly idle CPU which is only swapping processes in and out, but not getting any work done).

I also wish I had at least some 32 or 64GB of RAM :) I think that I would come to the same conclusion as @kenvc — the problem is much more memory-related than CPU-bound. But it's impossible to reach a conclusion on a server starved for memory...
(0021672)
antont (reporter)
2012-06-21 01:28

in reply to kenvc's "difficult/impossible to do a git bisect that goes back too far because of database field changes and config changes"

perhaps it would be possible to bisect this if could be made with a setup where just default configs are used and data is loaded from a OAR? as OARs are not (so) version sensitive.
(0021818)
-rjs- (reporter)
2012-07-19 13:39
edited on: 2012-07-19 13:43

--
Debian 6 Stable - Mono Source 2.11.2 - opensimulator.org and/or aurora-sim.org source trunk
Server: Quad core intel(r) xeon(r) 3+Gz, 16G ECC Reg Ram, 2TB disks
--

Same issues here as well with ran out of memory during iar export approx 4.8G.

Other debug messages:
1) "Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS"
2) [LocalAssetDatabase]: Failed to fetch asset 0195f4c3-7f86-4ceb-8b32-42f9acd601b4, System.OutOfMemoryException: Out of memory

Altered libgc/include/private/gc_priv.h to allow for larger MAX_HEAP_SECTS MAXHINCR

recompiled mono 2.11.2 with: sudo ./configure --with-large-heap=yes --with-sgen=yes && sudo make && sudo make install

/mono-2.11.2$ mono -V
Mono JIT compiler version 2.11.2 (tarball Thu Jul 19 14:55:25 CDT 2012)
Copyright (C) 2002-2012 Novell, Inc, Xamarin Inc and Contributors. www.mono-project.com
    TLS: __thread
    SIGSEGV: altstack
    Notifications: epoll
    Architecture: x86
    Disabled: none
    Misc: softdebug
    LLVM: supported, not enabled.
    GC: Included Boehm (with typed GC and Parallel Mark)

$ mono --debug Aurora.exe
went to do an iar save of approx 4.8G of data, gets to around 21K of assets saved to archive and you get a couple of minutes worth of ...

15:17:59 - [LocalAssetDatabase]: Failed to fetch asset 0195f4c3-7f86-4ceb-8b32-42f9acd601b4, System.OutOfMemoryException: Out of memory
  at (wrapper managed-to-native) object:__icall_wrapper_mono_array_new_specific (intptr,int)
  at MySql.Data.Types.MySqlBinary.MySql.Data.Types.IMySqlValue.ReadValue (MySql.Data.MySqlClient.MySqlPacket packet, Int64 length, Boolean nullVal) [0x00000] in <filename unknown>:0
  at MySql.Data.MySqlClient.NativeDriver.ReadColumnValue (Int32 index, MySql.Data.MySqlClient.MySqlField field, IMySqlValue valObject) [0x00000] in <filename unknown>:0
  at MySql.Data.MySqlClient.Driver.ReadColumnValue (Int32 index, MySql.Data.MySqlClient.MySqlField field, IMySqlValue value) [0x00000] in <filename unknown>:0
  at MySql.Data.MySqlClient.ResultSet.ReadColumnData (Boolean outputParms) [0x00000] in <filename unknown>:0
  at MySql.Data.MySqlClient.ResultSet.NextRow (CommandBehavior behavior) [0x00000] in <filename unknown>:0
  at MySql.Data.MySqlClient.MySqlDataReader.Read () [0x00000] in <filename unknown>:0
  at Aurora.Services.DataService.Connectors.Database.Asset.LocalAssetMainConnector.GetAsset (UUID uuid, Boolean displaywarning) [0x00046] in /Services/DataService/Connectors/Database/Asset/LocalAssetMainConnector.cs:193
15:17:59 - [LocalAssetDatabase]: Failed to fetch asset 31918be9-be9b-45bc-9d9f-20e262d88576, System.OutOfMemoryException: Out of memory
  at (wrapper managed-to-native) object:__icall_wrapper_mono_array_new_specific (intptr,int)
  at MySql.Data.Types.MySqlBinary.MySql.Data.Types.IMySqlValue.ReadValue (MySql.Data.MySqlClient.MySqlPacket packet, Int64 length, Boolean nullVal) [0x00000] in <filename unknown>:0
  at MySql.Data.MySqlClient.NativeDriver.ReadColumnValue (Int32 index, MySql.Data.MySqlClient.MySqlField field, IMySqlValue valObject) [0x00000] in <filename unknown>:0
  at MySql.Data.MySqlClient.Driver.ReadColumnValue (Int32 index, MySql.Data.MySqlClient.MySqlField field, IMySqlValue value) [0x00000] in <filename unknown>:0
  at MySql.Data.MySqlClient.ResultSet.ReadColumnData (Boolean outputParms) [0x00000] in <filename unknown>:0
  at MySql.Data.MySqlClient.ResultSet.NextRow (CommandBehavior behavior) [0x00000] in <filename unknown>:0
  at MySql.Data.MySqlClient.MySqlDataReader.Read () [0x00000] in <filename unknown>:0
  at Aurora.Services.DataService.Connectors.Database.Asset.LocalAssetMainConnector.GetAsset (UUID uuid, Boolean displaywarning) [0x00046] in /Services/DataService/Connectors/Database/Asset/LocalAssetMainConnector.cs:193

Stacktrace:
Native stacktrace:

    mono() [0x80e53c9]
    mono() [0x8131c94]
    mono() [0x805c071]
    [0xb77c140c]

Debug info from gdb:

=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================
Aborted


Doesn't make any difference if it's opensim or aurora-sim, although aurora makes it a bit further before it crashes with memory allocation issues.
In 5 or so years, not experienced this type of issue with large db's.

--rjs

(0021819)
-rjs- (reporter)
2012-07-19 14:41

Just another quick one. I don't believe this has anything to do with physical memory issues, but rather memory heap fragmentation.

--rjs
(0021841)
usalabs (reporter)
2012-07-19 16:31

@ -rjs-

I did find a big difference when using windows or Linux,,,windows uses the entire drive space for it's tmp folder, whereas Linux uses 3 partitions, a root and home primary partition and a swap partition,,,,,now, when windows installs or uses any program that transfers data, it uses the tmp folder, the same is for Linux, with one big difference, Linux tmp folder resides in the root partition, and is not that big, so when that folder gets filled, you don't see "out of disk space", but instead "out of memory" because Linux regards that folder as extra memory space, and not actual drive space storage. I would suggest when installing Linux on a drive that is larger than 500GB to have a root partition of at least 10GB.

Also mysql and it's databases reside on the root partition too, so it would be better to have the mysql data folder on it's own partition of about 100GB. These numbers are based on using a 1TB hard drive.

But, it would even be better to have a separate mysql box specifically for running mysql and holding all the databases.

My region server is a quad core Xeon with 12GB and 4x1TB hard drives using raid, but I also have a separate mysql server using an AMD Athlon X2 X64 with 4GB and 500GB hd. Both servers running OpenSuSe 11.4 on runlevel 3, and the region server is running mono 2.10.8.

With this setup, I haven't noticed any memory leaks or stack heap problems at all, and using a separate mysql server, takes the load off the region server, and also helps in reducing memory consumption.
(0021842)
Gwyneth Llewelyn (reporter)
2012-07-19 17:37

Hm. The issue about MySQL "fighting" for memory on the Linux box is worth considering. I have the chance to use an external MySQL server — in fact, I tended to use that in the past. The problem is that this MySQL server will be on a remote connection accessible via ADSL, which will add another layer of debugging — the network layer. Nevertheless, it's worth making a test: after all, thanks to the latest code optimisations, and the relatively good performance of the Flotsam cache, there is far less writing on the database these days, and, due to the asymetric nature of ADSL, the reading part is not so critical as the writing one.

I might give it a try.
(0021843)
-rjs- (reporter)
2012-07-19 19:50

@ usalabs,

I do my own LVM on the server drives. I did check into your /tmp theory however when backing up an iar of 4.8G, /tmp doesn't get used whatsoever.

Doing a quick df -h every sec or so, there was hardly any difference of any kind during the iar backup process concerning the partitions.

As far as free mem goes during the time the software is crashing, my server never runs out of physical ram space, nor does it disk thrash. Doesn't come close to either.

I'm not a fan of throwing hardware at software problems. Not when the hardware should be more than sufficient for what I do X100.
(0021863)
kenvc (reporter)
2012-07-22 16:22

Back to the original issue of the simulators that contain virtually the same number of prims they did a year ago now using almost twice the memory the exact same setup consumed about a year ago... and take mega-regions and warp3d out of the picture because those arent being used on this computer.

Does anyone have any other suggestions on how to reduce memory consumption without having to continue reducing the number of sims running on this computer to avoid "out of memory" related errors?
(0021864)
aiaustin (developer)
2012-07-23 00:48
edited on: 2012-07-23 00:51

Using r/19813 from 14-Jul-2012 (5d3723a...)

I am getting system of of memory errors on a simple setup to add just two regions to a grid (Openvue). The regions run on a local SQLite data base, though the main grid and all other regions are run on a MySQL data base on another server. Warp3D tile rendering is on with a refresh every 2 hours.

If I leave the 2 region SQLIte OpenSim.exe instance running a while I get a stream of errors. This morning when I looked the log file was 2GB long and so large I could not even open it with my text editors to see what the detailed errors looked like at the end. I will try to spot this problem before the .log file gets too large to explore and report back.

(0021866)
justincc (administrator)
2012-07-23 06:47

I have deleted 'Mr Peeved's note because attacks upon other people or companies are not welcome on this bug report tool or anywhere in the OpenSimulator project.
(0021868)
kenvc (reporter)
2012-07-23 07:34

Mr Peeved,

Mantis is not the place to slam other people or companies. In defense of Melanie and Avination, if you had read the full text of the article you referred to you would have seen this:

"Avination will share the improvements with other grids."
“This is currently proprietary but in keeping with Avination’s policy of releasing code this will eventually come to open source as well,” Thielker said.

“As Avination has its roots in OpenSim, we do have a policy to give developments back to the community,” added Avination spokesperson Leonie Gaertner. “This code will be made open source, when it has been checked and reviewed carefully.”
(0021869)
BlueWall (administrator)
2012-07-23 07:52

This is the wrong place for this. But I will throw my .02 in here...

OpenSimulator is distributed under a liberal license and you may do anything you want without obligation to contribute anything back to the core project. You are also welcome to contribute patches to the project and they will be evaluated by the core team and considered for inclusion in the codebase if they meet our criteria.

Melanie and Avination have contributed many things back to the OpenSimulator project and have a policy of doing so after a period of time. And this is acceptable to the rest of the core team. Melanie is also helpful and instrumental when adding new facilities to OpenSimulator. A few months ago We worked together to bring Telehubs and parcel sales layer to the map. She made these much better and was a valuable player in these projects. We are now working to add remote loading of modules and looking to make third party modules easier to distribute and more reliable via Robust, and we will probably take this work into OpenSimulator.exe as well.

Please take this rant elsewhere, because it doesn't belong here and is based on wrong assumptions. This has nothing to do with the mantis subject.
(0021871)
melanie (administrator)
2012-07-23 11:31

I have pushed some fixes we have been working on - please test
(0021872)
kenvc (reporter)
2012-07-23 11:40

Will do right now. Thanks so much Melanie!!!
(0021873)
kenvc (reporter)
2012-07-23 12:18

Just updated with newest Dev master. After startup of all instances, a little more CPU time appears to be used than before but a little less memory is also being used.

Some instances are starting OK, but others are displaying a series of red error messages that I don't normally see. The portion of the log file from one of these will be attached shortly. The instances are start at staggered times to reduce CPU and disk load. Oddly enough, the ones that are displaying these messages were the first instances to start so CPU and disk load should have been reduced but they are also the ones that contain the most prims and scripts. See attached log file.
(0021874)
kenvc (reporter)
2012-07-23 13:16

OK, pulled the round 3 series of Melanie's memory leak fixes, recompiled, updated, and this time all instances started without the errors mentioned in the previous note.

CPU load seems back to normal. Consumed memory does not appear to be reduced, but if it stays stable at the current level maybe it will be OK. Will continue watching it.

A thought just hit me... Vivox was introduced some time back, but I don't recall the memory issue problem starting at the time Vivox was introduced. Does anyone have an idea how much memory Vivox consumes per instance or if this could be contributing in a small way to the increased memory requirements?
(0021875)
kenvc (reporter)
2012-07-23 14:14

I teleported around and hit every sim in every instance and things seemed stable with no out of memory errors. The memory requirements are still higher than about a year ago, but at least the requirement doesn't seem to be growing with time after startup like it was before.

Others who have had this issue, need to try these changes and give detailed feedback for Melanie. So far, I'd have to say its possibly better, and certainly no worse so that's good news to me.
(0021876)
melanie (administrator)
2012-07-23 14:16

The higher base memory need is normal - we store increasingly complex information about objects and objects also have become more complex on average. I'm not sure about OSGrid or your grid, but Avination is seeing objects of the same complexity as SL. Additional memory consumption to store those is to be expected.
(0021883)
kenvc (reporter)
2012-07-24 21:33

I reduced the number of instances even further and am still seeing out of memory errors when there is still plenty of free memory. Most often the first sign the out of memory error is about to happen is preceeded by the following error, which leads me to think the problem may be related to this area.

2012-07-24 21:09:53,987 ERROR -OpenSim.Region.CoreModules.Scripting.LoadImageURL.LoadImageURLModule [LOADIMAGEURLMODULE]: OpenJpeg Conversion Failed. Empty byte data returned!

Then shortly after that error, the following out of memory error often occurs:

2012-07-24 21:10:43,733 ERROR - OpenSim.Application [APPLICATION]:
APPLICATION EXCEPTION DETECTED: System.UnhandledExceptionEventArgs

Exception: System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at System.Threading.Thread.StartInternal(IPrincipal principal, StackCrawlMark& stackMark)
   at System.Threading.Thread.Start()
   at Amib.Threading.SmartThreadPool.StartThreads(Int32 threadsCount)
   at Amib.Threading.SmartThreadPool.Enqueue(WorkItem workItem, Boolean incrementWorkItems)
   at Amib.Threading.SmartThreadPool.QueueWorkItem(WorkItemCallback callback, Object state)
   at OpenSim.Framework.Util.FireAndForget(WaitCallback callback, Object obj)
   at OpenSim.Region.ClientStack.LindenUDP.LLUDPServer.PacketReceived(UDPPacketBuffer buffer)
   at OpenMetaverse.OpenSimUDPBase.AsyncEndReceive(IAsyncResult iar)
   at System.Net.LazyAsyncResult.Complete(IntPtr userToken)
   at System.Net.ContextAwareResult.CompleteCallback(Object state)
   at System.Threading.ExecutionContext.runTryCode(Object userData)
   at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Net.ContextAwareResult.Complete(IntPtr userToken)
   at System.Net.LazyAsyncResult.ProtectedInvokeCallback(Object result, IntPtr userToken)
   at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* nativeOverlapped)
   at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* pOVERLAP)

Application is terminating: True
(0021884)
usalabs (reporter)
2012-07-24 23:33

After reading carefully through all these notes, I noticed something that hasn't been addressed in kenvc's problem, could it be the system memory itself?

How long has the system memory been in the PC?, and being that you're using windows 7 and have 8 gig of RAM, windows would not even touch nowhere near 4 gigs, which could be a memory chip related error, I'll explain.

When mono or any other software that starts filling up memory, and it reaches a part of the system memory that could be bad, it will throw a memory error, perhaps mono has no means of knowing if it's a memory fault or if the memory is actually full, (which I doubt it will be full with 8 gigs),,,, even though the PC POST memory test shows 8 gigs, it will show that, because, nowadays, the new DDR memory doesn't set the memory error bit on a check and according to POST and windows the memory is fine, but, it could be bad anywhere from 4 gigs up to the 8 gigs, if it was bad below 4GB, then windows will suffer.

I suggest power down the PC, remove 1 memory stick then power up, and try opensim again, if it shows a memory error, then quit opensim, power down, then put back the stick and remove the next one closet to the one previously removed, power up, and start opensim again, repeat until no memory errors show up,,, and the one that has been removed would be the one that is bad.

I've had people say to me, "I can't install windows because during the install, it keeps getting as far as 'Starting Windows............" then I get the BSOD, so I checked the memory using the above procedure, and found it was the 2nd stick of 1GB (total 2GB) which was bad, and he replaced it and it's been working ever since.
(0021895)
aiaustin (developer)
2012-07-25 07:15
edited on: 2012-07-25 07:23

As noted before... with r/19813 from 14-Jul-2012 (5d3723a...).. added error message here...

I have one instance of OpenSim.exe for the Openvue grid that seems to run out of memory ever few days... with just two light regions on it and run on a SQLite on the host connected to our main Robust services on another machine. Other OpenSim.exe instances on other hosts seem to run fine with no out of memory errors even with more regions.

I keep getting enormous log files 2GB long in a few days... so cannot open them with an editor to see extended error messages. But the OpenSim.exe console gives this for example... (many of these)...

14:00:33 - Command error: System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at OpenSim.Framework.Console.Commands.Resolve(String[] cmd)
   at OpenSim.Framework.Console.LocalConsole.ReadLine(String p, Boolean isCommand, Boolean e)
   at OpenSim.Framework.Console.CommandConsole.Prompt()
   at OpenSim.Application.Main(String[] args)


14:00:33 - [ODE SCENE]: Attempted to read or write protected memory.
This is often an indication that other memory is corrupt., Void SpaceRemove(IntPtr, IntPtr),
System.AccessViolationException: Attempted to read or write protected memory.
This is often an indication that other memory is corrupt.

(0021896)
usalabs (reporter)
2012-07-25 08:29

@ aiaustin

Quote from aiaustin "Other OpenSim.exe instances on other hosts seem to run fine with no out of memory errors even with more regions." and "Attempted to read or write protected memory. This is often an indication that other memory is corrupt"

I would still check your system memory, just because the memory looks fine, don't take it for granted that it is, any area of the memory can go bad, and still report as having the correct memory size installed.

Checking the system memory first before turning to software, ensures at least you know if the memory checks out fine, it's not a memory problem, if googled, about out of memory errors, and memory related errors, about 70% of the time is directly related to the system RAM, and when that has been checked and cleared, then, consideration would go towards the software.
(0021934)
aiaustin (developer)
2012-07-30 03:15

I have not seen the out of memory errors on my specific problematic OpenSim.exe instance since upgrading to a release that included Melanie's memory leak fixes. And the instance has been running quite a bit longer than the time by which they were noticed before. I will leave the instance running as long as I can to see if the problem occurs at a later time.
(0021953)
aiaustin (developer)
2012-08-01 03:01
edited on: 2012-08-01 03:05

The Openvue Opensim.exe instance did eventually produce the out of memory errors, but may have run longer. So I will try to perform a memory hardware test when I can physically access this server which runs on Windows Vista 32 bit.

But I also have out of memory errors on a long running OSGRid region hosting OpensSiml.exe server on a much more recent Xeon processor Windows 7 machine, so I really do not suspect hardware.

(0021958)
kenvc (reporter)
2012-08-01 08:20
edited on: 2012-08-01 08:27

The lasy time I had memory issues I was using the stock opensim.32bit.launch.exe and I never saw an instance using more than abpout 1.2 gig or ram, but still saw the out of memory errors sometimes. From what I understand, this executable allows addressing up to 2gb of memory per instance.

I then used a different opensim.32.bit.launch.exe that was compiled to address up to 4 gig of memory per instance and most out memory issues have vanished. Since none of the instances ever showed to be using more than about 1.2gb of memory at the most, it really seems odd that this would help.

It makes me think that something is happening that is causing memory use to suddenly spike way up and exceed this 2gb limit and then go back down, but I have never seen a memory usage spike above about 1.2gb on this instance during the course of monitoring the memory usage.

(0021983)
-rjs- (reporter)
2012-08-06 08:49
edited on: 2012-08-06 09:05

Patched os trunk:

Simple unix test, fired it up, left it run, checked mem.

Go to run other progs after letting it run, in this case tripwire --check --interactive, wouldn't run and gave errors.

Had to reboot to regain lost mem, back to normal with mono/os not running.

Go to restart os service, notice 4 cores of the CPU are being utilized to 400 percent, at the same time that it reached this CPU utilization, the return of the out of memory errors as I listed above while doing a backup iar of 4.8G of data. (400 percent CPU usage during this iar backup event process at which time the memory errors were most pronounced.)

Over night, or a couple days, running mono/os seems to provide substantial evidence of a severe memory leak.

--rjs

(0021984)
Gwyneth Llewelyn (reporter)
2012-08-07 09:26

Some further feedback from OS 0.7.3.1...

After the comment from @usalabs and others, I thought hat maybe my memory was faulty somehow. Even though I would expect some kind of logs on the system log, it was worth a try. So after my home server's power supply blew up (another potential source of hardware problems!... before they blow up, they might be sending irregular power supply to the motherboard, which gives all kinds of weird problems that come and go away), I also let the techs upgrade memory from 2 GB to 6 GB.

At first, things seemed to be a bit better. Even though Linux tends to allocate as much memory as it can, it hit something like 2.5 GB and stabilised, which meant that it seemed to have all the memory it liked and plenty of room of spare. But CPU usage was still rather high — both cores were at 150% load each, which, as said, is unusual for OpenSim, at least for extended periods.

A co-worker didn't notice any real "improvements". And in fact, after 4 days, memory usage continued to climb and climb. It now consumes 5.2 GB, although it took these 4 days to reach that value. No instance has been rebooted, so this is the "normal" consumption level taking into account any memory leaks. Note that the grid is "empty" of avatars most of the time, and has very few active scripts — probably not more than a handful, and the "heaviest" is the one for the HyperGate. While overall sever load goes sometimes as low as 70-80% of each core, this is the exception; 150% load each core is the average, spiking very high as soon as a few avatars log in.

Mono is Debian 2.10.8.1-1ubuntu2.2, which allegedly is the latest stable version available.

I will now proceed shortly to try out the recently announced RC for OpenSim 0.7.4, and, as per previous suggestions, move both MySQL and Apache out of the server, even though this will mean struggling to get all requests through an ADSL connection instead of local sockets. I'm rather curious to see if this will make a difference in terms of memory & CPU consumption. Currently neither MySQL nor Apache show up on 'top' for CPU, although they most certainly take a large chunk of memory!
(0021985)
kenvc (reporter)
2012-08-07 10:04

Im using Windows 7 64 bit with 8 gig ram and using the 32 bit launcher to help keep the memory usage down. I haven't really seen any CPU usage problems, I only see increased memory consumption way over what it used to be. The event that seems to make it jump more than others is when someone teleports in through hypergrid or when a new map is being generated for the sim. I do not use warp3d for map generation.
(0021993)
-rjs- (reporter)
2012-08-08 07:15

To deduce mono from the equation, I've compiled several mono revisions as far back as 2.6.7. Or the old debian equiv revs to see if there were any changes in what we are seeing here. Suspicions were correct, and It can safely be said that the issues are not the result of any mono "direct" mono changes in and of itself.

The issues are witnessed with all mono revs tested. Running latest stable source 2.11.2 didn't alter the results.

http://download.mono-project.com/sources/mono/ [^]

--rjs
(0021998)
usalabs (reporter)
2012-08-08 08:46

@ Gwyneth Llewelyn

quote from Gwyneth:- "as per previous suggestions, move both MySQL and Apache out of the server, even though this will mean struggling to get all requests through an ADSL connection instead of local sockets"

If you are running mysql, opensim, and apache on home servers, you don't need to connect to each one using ADSL, just give each server it's own static LAN IP, and if each server has it's own firewall, only open port 3306 on the mysql server for internal access, and leave it closed on the router, the rest of the ports can be opened on each servers firewall and the router, then configure the necessary opensim ini files to use those servers internal LAN IP's. IE, instead of using localhost for mysql connection, substitute localhost with the mysql internal LAN IP. That's how I have mine setup, 1 server for opensim regions, 1 server for mysql and the other for apache, but bare in mind, network bottleneck when large amounts of data is being transferred across the network, even if utilising 10/100 NIC's, I use gigabit NIC's on all the servers, and 10/100 for workstations.
(0022004)
Gwyneth Llewelyn (reporter)
2012-08-08 18:04
edited on: 2012-08-08 18:35

First impressions from 0.7.4 RC1: memory and CPU consumption seem to be exactly as before (i.e. very high). This is sadly not a perfect test: I'm now tracking down why groups, profiles, offline messaging, etc. don't work any longer. Groups at least are a requirement for a production environment, so for now I have a few days for testing, but if I can't get them working again, I'll have to revert back to 0.7.3.1, to my great disappointment :(

Log rotation also seemed to have stopped working, who knows why. Oh well. I've filed a Mantis for it, too.

@usalabs I wish I had the kind of environment you're suggesting :-) I just have a single computer that I can use as a main server — an old and battered HP Pavillion — to run the OpenSim instances, so I have no option but to push the remaining services (MySQL, Apache) onto a very cheaply hosted external server (which sadly cannot run OpenSim :( or my problems would all be solved). Thus the ADSL bottleneck. The advantage is that, except for building, most of the traffic will be downstream, so at least MySQL should work fine. The remaining services aren't working right now, not even locally, so I can't foresee what will happen when they're hosted remotely. Messages pertaining to group permissions seem to be plentiful, so, well, we'll see what happens.

[Edited: my optimism when writing this message led me to believe that OpenSim was consuming less resources, until I found out that only one of the six instances had actually launched. After launching the remaining ones, memory & CPU consumption skyrocketed to the same levels as under 0.7.3.1, and I still have to fix groups, profiles, offline messaging, and now, rotating logs. So it looks like 0.7.4 RC1 is even worse than 0.7.3.1 :-( ]

(0022005)
kenvc (reporter)
2012-08-08 22:44

Gwyneth,
All the items you mentioned are not working are working fine for me and are not directly related to this mantis.

Things have changed in some of the external modules such as profiles, search, offline messsaging, groups, etc. and they need to be updated Things have changed in the some of the config files too. I update mine almost every day, so I deal with each change immediately as it happens and dont get hit all at once with many changes.
(0022018)
Gwyneth Llewelyn (reporter)
2012-08-10 11:52

@kenvc agreed. I've managed to get all extra services working again, after a week of figuring out what was wrong. I can reasonably claim now that I have pretty much the same environment under 0.7.4RC1 than I had under 0.7.3.1. Now it's time to move MySQL and the external web-based services out of the server and see what happens to memory & CPU consumption. That's work for the weekend :) Getting everything up and running on 0.7.4.RC1 took way, way too long.
(0022025)
Gwyneth Llewelyn (reporter)
2012-08-11 18:21

Unfortunately, my test environment with external MySQL/Apache simply didn't work out. The connection is not fast enough; it times out before it gets a successful connection. I have to move everything back to a single server :-(

However, I've noticed an amazing thing. When switching over Apache, I had to change the DNS entries for the external modules (profile, groups, etc.). This took some time to propagate, and, tired of waiting, I actually launched nscd (name service caching daemon), in order to be able to refresh DNS more quickly. This actually didn't refresh DNS as expected, BUT it made a HUGE impact on CPU performance!!

In fact, CPU load levels are suddenly back to 0.7.2 levels — e.g. about 60-70% overall load per CPU core, sometimes even lower than that, even with a few logged in users! Memory consumption is hard to estimate yet, since the whole grid launched from scratch after a server reboot, so I will only be able to report on eventual memory leaking in a few days.

Here is what I think that might be happening: due to the setup exposing my grid from behind a firewall which does some NAT translation, a lot of DNS requests are being made for all those services. As OpenSim moves more and more to using capabilities — plus the extra checks for groups and permissions! — more and more DNS lookups are being made. It might be possible that my Linksys router has reached a threshold on the amount of DNS requests it can actually make, and this somehow blocked external calls from being run until the DNS requests could complete. Now, however, nscd caches the requests — which means the processes do not block for so long, as they usually get a quick reply from a locally running DNS cache. And as a result, since the Linux scheduler isn't blocking so many tasks waiting for DNS call completion, it reports a lower load.

This actually makes some sense. It could be a coincidence. It might also throw some light in the way OpenSim, by itself, caches DNS requests — perhaps 0.7.3 does it differently from 0.7.2? — or simply show that from 0.7.3 onwards, a substantial amount of calls moved to HTTP-based caps calls, and, as such, require far more DNS requests.

If this is the case, running a local DNS caching server might be just the solution for this "problem", and the reason many people aren't able to reproduce this "problem" is just because they either host their grid without NAT, or already have a DNS caching server active.

I'm now curious to see what happens after leaving the grid running for a few days.
(0022029)
usalabs (reporter)
2012-08-12 01:29

I just found out the excessive memory consumption IS actually during inventory load.

I did a test on my region that has 3729 prims, and 545 active scripts, and using Firestorm 4.1.1 for the viewer which doesn't initiate inventory loading until something is searched for, ok, now, while my region was running and I was logged in for 1 hour, and no inventory search, the memory status showed:- 268 MB for objects and 352 MB processing memory, which fluctuated a bit, but as soon as I did an inventory search, and the inventory continued loading from 27K of 32K items, I noticed my av was acting weird, walking forward, then jerking back, so I looked at the region console, and noticed very long delays while fetching inventory from the osgrid inventory server:- Slow request to <218> POST http://inventory.osgrid.org/xinventory [^] took 7348ms, 3ms writing, so while it was showing line after line of excessive slow requests, I started looking at the stats, and noticed the memory consumption very quickly increasing, it took about 10s to get from:- Allocated to OpenSim objects: 618 MB Process memory: 898 MB to:- Allocated to OpenSim objects: 834 MB Process memory: 1116 MB. then a total of 30s to get from 618 MB objects 898 MB processing to:- Allocated to OpenSim objects: 1144 MB Process memory: 1329 MB before I shut down the region server,,,, if I had let it continue, it would have run out of memory within a few minutes.
(0022046)
Gwyneth Llewelyn (reporter)
2012-08-14 05:17

@usalabs that could very well be the case. On the grid I'm running, two users (one of them myself!) have relatively large inventories, i.e. around a thousand items or so. However, after we each have logged in several times from different computers and with different viewers, I don't see a huge memory increase *inside the console*. We're logging in always to the same instance, which has two regions, one with 13,000+ prims, the other with just a single building.

This is what I got after a few days:

Allocated to OpenSim objects: 155 MB
OpenSim object memory churn : 0.293 MB/s
Process memory : 279 MB

It doesn't seem too much.

HOWEVER, from the *outside*, it's a different story! Memory consumption steadily grew to 5.5 GB and is still growing; it will start to eat into swap space soon. So in my case I believe that the major problem is Linux not releasing memory from Mono, and it just grows and grows.

On the other hand, the local DNS cache has solved all my CPU troubles, once and for all. Now I can get as low as 10% CPU load per core with two users logged in. This is way better than I got under 0.7.2. So, at least for me, and as suggested often on this long thread, CPU and memory issues are not related — CPU troubles come from "elsewhere".

I can confirm I had a LOT of those Slow Requests for xinventory. Since everything is being run from a single server behind the firewall, I shouldn't be getting so many "Slow Requests" — it's a Gigabit LAN, and all computers + the small server are on the same switch.

However, since I started to cache DNS, the amount of Slow Requests has reduced dramatically. On Aug 11, I got 416 per day (before I started caching DNS). Now I get a dozen at most (yesterday I got none). They're mostly xinventory requests and the occasional OSD REQUEST.

@usalabs, I remember you posted earlier on that you're also using Linux. Can you install nscd and see what happens in your case? It would be nice if we could see a decrease on the Slow Requests and CPU load on your server, too. Memory COULD be being allocated to handle pending DNS requests, although in your case, this would be dramatic, so I think it's something else altogether. But it's worth installing nscd and see if it makes a difference to you as well. In my case the difference was really dramatic!
(0022051)
usalabs (reporter)
2012-08-14 10:37

@Gwyneth

I checked on my server and didn't realise it, but nscd has been installed since I installed OpenSuSe a few years ago, sudo chkconfig nscd shows it as on, and running at runlevels 3 & 5, and sudo /etc/init.d/nscd status shows as running, but I still get those slow requests when loading inventory,,, I even changed my DNS servers from my ISP to a public DNS such as google's DNS service and the requests get even slower.

There is a way to release mono memory, I use it in a cron job every 15 minutes,,, create a bash script as root, and place it in root's home directory, called clear-cache.sh then make the file executable, sudo chmod +x /root/clear-cache.sh the script contents are:-

-----start code-----

#! /bin/bash

sync
echo 3 > /proc/sys/vm/drop_caches

------end code-------

add the lines between the ----- to the bash file, and add it to your con job to execute as root every 15 minutes, this will release mono memory back to the system.

If you don't know how to add a cron job, use crontab -e and insert this in the file:-

0,15,30,45 * * * * /root/clear-cache.sh #Clears system Cache

then press :wq to save the file.

The bash script does work, and if anyone is logged inworld, it will only release the memory that isn't being used, and when logged out,,, the memory that was being used by the av's is returned to the system, but it doesn't reduce the memory already allocated by opensim's processing threads which does increase during login/out, object additions, inventory loading, and running scripts, the only way to return that memory is to restart opensim instances.
(0022053)
Gwyneth Llewelyn (reporter)
2012-08-14 11:18

Ah, thanks! Aye, I've used that "trick" in the past: it's great to deal with Mono's faulty garbage collector :) but I had forgotten the actual commands...

In that case, I have to report that my problems have been solved and it doesn't seem likely that I can contribute further to exploring this issue. In my case:

1) Adding nscd made CPU consumption drop to 0.7.2 levels or even below;
2) Xinventory and other "Slow Requests" practically disappeared with nscd turned on;
3) Memory consumption reported internally by OpenSim under 0.7.4RC1 is now at the 0.7.2 levels;
4) External memory consumption as reported by top, free, etc. is dealt with nicely by regularly running the "drop_caches" trick.

With all the above my own issue is solved.
(0022055)
kenvc (reporter)
2012-08-14 12:57
edited on: 2012-11-18 23:19

That sounds encouraging for Linux users, but how does this relate to what can be done to help with this memory issue in Windows as originally reported in this Mantis?

(0022056)
justincc (administrator)
2012-08-14 13:01

Gwyneth, out of curiosity, how many regions/objects/scripts are you running?

If mono (via OpenSimulator) is reporting a process memory of 279 MB but top is showing 5.5GB, then that would indicate to me either a mono issue or even an interesting kernel interaction. This would particularly be the case if clearing Linux kernel caches via echo 3 > /proc/sys/vm/drop_caches helps. This would mean it isn't directly an OpenSimulator issue.
(0022080)
Gwyneth Llewelyn (reporter)
2012-08-15 04:53

@justincc — I'm running 23 regions, spread among 7 instances (as evenly as possible, but it's not perfect). I have no idea of how many prims are used total — my best guess would be over 100,000. A few regions have 15,000 prims or close, many have over 6000, and a few are practically empty. Scripts are very, very few — I'd be surprised if there were more than a dozen, but it's likely that there are even fewer than that.

The instance taking the most memory consumes 235 MB (objects) / 577 MB (process memory). The one taking the least 15 MB/ 151 MB (it's an instance with a single estate with two next-to-empty regions)

And yes, clearing caches works rather well — top/free show a reduction from 5.5 GB to just very slightly above 2 GB.

The DNS issue (lack of caching) is actually a good hint that the "problem" might not be OpenSim-related. It is reasonable to believe that a lot of buffers might be opened for pending DNS requests which are coming in very slowly. Similar things might be happening with establishing connections with the clients (HTTP or UDP) or even among simulators/instances/ROBUST. I now tend to believe this MIGHT be the case, because all the increases in performance I got didn't come from tweaking OpenSim, but at the operating system level...

I'm now trying to add Warp3D map tiles, which was reported to be a culprit in memory consumption elsewhere. Not all instances are using Warp3D but I'm rather curious about what happens. If you're right, this should show little increase in memory as shown by OpenSim, but might have "strange" effects at the OS level. It's worth experimenting!
(0023885)
kenvc (reporter)
2013-05-13 16:28

After a few memory leak issues have been fixed over the last month or so, I have not been seeing the slow increase in memory consumption that I was seen before. It may be OK to close this issue and reopen if it starts up again... or create a new issue with new clues.
(0024347)
kenvc (reporter)
2013-09-14 08:21

Looks like when Bulletsim is enabled in the latest code, the problem is back, see mantis: http://opensimulator.org/mantis/view.php?id=6766 [^]
(0024546)
kenvc (reporter)
2013-10-20 06:33

If Robert was able to make such a drastic improvement in BulletSim's memory use by simply compressing the terrain data, it makes you wonder how much the memory consumption of OpenSim itself could be reduced by using similar techniques with data where appropriate in Opensim.

It might even be possible to get OpenSim's memory consumption back to what it was a couple years ago.
(0024551)
Robert Adams (administrator)
2013-10-20 22:07

The 'compression' I did was to enable the quantized collision search feature in Bullet's btBvhTriangleMeshShape based meshes. Bullet builds trees to make searching of large mesh shapes for collisions quicker. The old way built gigantic flat tables rather than hierarchical search tables. The quantized table feature is enabled in every Bullet use example I find so why I had it turned off is lost in the fog of the past.

There are probably places in OpenSimulator that can shrink with careful massaging, but the mentioned fix to BulletSim is not an example of what can be done elsewhere.
(0025057)
kenvc (reporter)
2014-01-25 10:34

Bulletsim is still using more memory than ODE, especially when used with the 32 bit launcher, but it is not that much more (about 25%) after Roberts last changes. I am going to consider this resolved at least the best it can be done.

- Issue History
Date Modified Username Field Change
2012-05-23 12:14 kenvc New Issue
2012-05-23 12:19 usalabs Note Added: 0021495
2012-05-23 12:21 kenvc Note Added: 0021498
2012-05-23 12:28 Pixel Tomsen Note Added: 0021500
2012-05-23 12:29 kenvc Priority normal => high
2012-05-23 12:29 kenvc Severity minor => major
2012-05-23 12:29 kenvc Status new => confirmed
2012-05-23 12:31 Pixel Tomsen Note Edited: 0021500 View Revisions
2012-05-23 16:51 justincc Note Added: 0021518
2012-05-23 19:42 usalabs Note Added: 0021519
2012-05-23 19:44 usalabs Note Edited: 0021519 View Revisions
2012-05-23 19:45 usalabs Note Deleted: 0021519
2012-05-23 20:19 kenvc Note Added: 0021522
2012-05-24 14:31 aiaustin Note Added: 0021528
2012-05-25 16:26 WhiteStar Note Added: 0021529
2012-05-25 16:51 justincc Note Added: 0021530
2012-05-25 18:17 usalabs Note Added: 0021537
2012-05-25 18:25 usalabs Note Edited: 0021537 View Revisions
2012-05-25 18:33 usalabs Note Edited: 0021537 View Revisions
2012-05-25 18:51 usalabs Note Edited: 0021537 View Revisions
2012-05-25 18:51 usalabs Note Edited: 0021537 View Revisions
2012-05-25 22:28 kenvc Note Added: 0021557
2012-05-26 17:34 kenvc Note Added: 0021564
2012-05-27 21:05 kenvc Note Edited: 0021564 View Revisions
2012-05-28 03:44 Gwyneth Llewelyn Note Added: 0021566
2012-05-28 05:02 usalabs Note Added: 0021568
2012-05-28 05:03 usalabs Note Edited: 0021568 View Revisions
2012-05-28 05:06 usalabs Note Edited: 0021568 View Revisions
2012-05-28 05:07 usalabs Note Edited: 0021568 View Revisions
2012-05-28 05:09 melanie Note Added: 0021569
2012-05-28 05:11 usalabs Note Edited: 0021568 View Revisions
2012-05-28 05:16 usalabs Note Added: 0021570
2012-05-28 05:23 usalabs Note Edited: 0021570 View Revisions
2012-05-28 05:29 kenvc Steps to Reproduce Updated View Revisions
2012-05-28 05:35 usalabs Note Edited: 0021570 View Revisions
2012-05-28 05:35 usalabs Note Edited: 0021570 View Revisions
2012-05-28 05:54 usalabs Note Added: 0021571
2012-05-28 05:55 usalabs Note Edited: 0021571 View Revisions
2012-05-28 11:06 Gwyneth Llewelyn Note Added: 0021575
2012-05-28 11:25 kenvc Summary 'System.OutOfMemoryException' was thrown. Simulators appear to be using up much more memory than in the past. => 'System.OutOfMemoryException' - Simulators are consuming much more memory than in the past.
2012-05-28 11:50 usalabs Note Added: 0021576
2012-05-28 11:57 usalabs Note Edited: 0021576 View Revisions
2012-05-28 12:01 usalabs Note Edited: 0021576 View Revisions
2012-05-28 12:01 usalabs Note Edited: 0021576 View Revisions
2012-05-28 12:02 usalabs Note Edited: 0021576 View Revisions
2012-05-28 12:03 usalabs Note Edited: 0021576 View Revisions
2012-05-28 12:07 usalabs Note Edited: 0021576 View Revisions
2012-05-28 17:51 Gwyneth Llewelyn Note Added: 0021587
2012-05-28 18:33 Gwyneth Llewelyn Relationship added related to 0003353
2012-05-29 10:18 WhiteStar Note Added: 0021591
2012-05-29 12:13 Gwyneth Llewelyn Note Added: 0021593
2012-05-29 13:05 usalabs Note Added: 0021594
2012-05-29 13:05 usalabs Note Edited: 0021594 View Revisions
2012-05-29 13:06 usalabs Note Edited: 0021594 View Revisions
2012-05-29 13:15 usalabs Note Edited: 0021594 View Revisions
2012-05-30 03:32 usalabs Note Added: 0021595
2012-05-30 03:34 usalabs Note Edited: 0021595 View Revisions
2012-05-30 06:22 Gwyneth Llewelyn Note Added: 0021596
2012-05-30 07:08 usalabs Note Added: 0021597
2012-05-30 14:47 usalabs Note Added: 0021599
2012-05-31 02:44 Gwyneth Llewelyn Note Added: 0021601
2012-05-31 06:16 usalabs Note Added: 0021602
2012-05-31 06:21 usalabs Note Edited: 0021602 View Revisions
2012-05-31 11:20 kenvc Note Added: 0021603
2012-05-31 11:24 kenvc Note Edited: 0021603 View Revisions
2012-05-31 15:32 Gwyneth Llewelyn Note Added: 0021604
2012-05-31 16:12 Gwyneth Llewelyn Note Added: 0021606
2012-05-31 16:13 Gwyneth Llewelyn Relationship added related to 0006030
2012-05-31 17:48 Gwyneth Llewelyn Note Added: 0021608
2012-05-31 17:48 Gwyneth Llewelyn Note Edited: 0021608 View Revisions
2012-05-31 21:06 justincc Note Added: 0021610
2012-06-01 04:32 usalabs Note Added: 0021611
2012-06-01 04:37 usalabs Note Edited: 0021611 View Revisions
2012-06-01 04:38 usalabs Note Edited: 0021611 View Revisions
2012-06-01 04:39 usalabs Note Edited: 0021611 View Revisions
2012-06-01 04:39 usalabs Note Edited: 0021611 View Revisions
2012-06-01 04:41 usalabs Note Edited: 0021611 View Revisions
2012-06-01 04:42 usalabs Note Edited: 0021611 View Revisions
2012-06-01 04:46 usalabs Note Edited: 0021611 View Revisions
2012-06-01 04:48 usalabs Note Edited: 0021611 View Revisions
2012-06-01 04:48 usalabs Note Edited: 0021611 View Revisions
2012-06-01 04:51 usalabs Note Edited: 0021611 View Revisions
2012-06-01 09:32 BlueWall Note Added: 0021612
2012-06-05 20:20 justincc Note Added: 0021625
2012-06-18 17:23 kenvc Note Added: 0021669
2012-06-18 17:23 kenvc Priority high => urgent
2012-06-18 17:42 Gwyneth Llewelyn Note Added: 0021670
2012-06-21 01:28 antont Note Added: 0021672
2012-07-19 13:39 -rjs- Note Added: 0021818
2012-07-19 13:43 -rjs- Note Edited: 0021818 View Revisions
2012-07-19 14:41 -rjs- Note Added: 0021819
2012-07-19 16:31 usalabs Note Added: 0021841
2012-07-19 17:37 Gwyneth Llewelyn Note Added: 0021842
2012-07-19 19:50 -rjs- Note Added: 0021843
2012-07-22 16:22 kenvc Note Added: 0021863
2012-07-23 00:48 aiaustin Note Added: 0021864
2012-07-23 00:51 aiaustin Note Edited: 0021864 View Revisions
2012-07-23 00:51 aiaustin Note Edited: 0021864 View Revisions
2012-07-23 06:40 Mr_Peeved Note Added: 0021865
2012-07-23 06:44 justincc Note View State: 0021865: private
2012-07-23 06:46 justincc Note Deleted: 0021865
2012-07-23 06:47 justincc Note Added: 0021866
2012-07-23 07:34 kenvc Note Added: 0021868
2012-07-23 07:52 BlueWall Note Added: 0021869
2012-07-23 11:31 melanie Note Added: 0021871
2012-07-23 11:40 kenvc Note Added: 0021872
2012-07-23 12:15 kenvc File Added: Opensim.log
2012-07-23 12:18 kenvc Note Added: 0021873
2012-07-23 13:16 kenvc Note Added: 0021874
2012-07-23 14:14 kenvc Note Added: 0021875
2012-07-23 14:16 melanie Note Added: 0021876
2012-07-24 21:33 kenvc Note Added: 0021883
2012-07-24 23:33 usalabs Note Added: 0021884
2012-07-25 07:15 aiaustin Note Added: 0021895
2012-07-25 07:23 aiaustin Note Edited: 0021895 View Revisions
2012-07-25 08:29 usalabs Note Added: 0021896
2012-07-30 03:15 aiaustin Note Added: 0021934
2012-08-01 03:01 aiaustin Note Added: 0021953
2012-08-01 03:05 aiaustin Note Edited: 0021953 View Revisions
2012-08-01 08:20 kenvc Note Added: 0021958
2012-08-01 08:27 kenvc Note Edited: 0021958 View Revisions
2012-08-06 08:49 -rjs- Note Added: 0021983
2012-08-06 08:50 -rjs- Note Edited: 0021983 View Revisions
2012-08-06 08:53 -rjs- Note Edited: 0021983 View Revisions
2012-08-06 08:55 -rjs- Note Edited: 0021983 View Revisions
2012-08-06 08:59 -rjs- Note Edited: 0021983 View Revisions
2012-08-06 09:05 -rjs- Note Edited: 0021983 View Revisions
2012-08-07 09:26 Gwyneth Llewelyn Note Added: 0021984
2012-08-07 10:04 kenvc Note Added: 0021985
2012-08-08 07:15 -rjs- Note Added: 0021993
2012-08-08 08:46 usalabs Note Added: 0021998
2012-08-08 18:04 Gwyneth Llewelyn Note Added: 0022004
2012-08-08 18:35 Gwyneth Llewelyn Note Edited: 0022004 View Revisions
2012-08-08 22:44 kenvc Note Added: 0022005
2012-08-10 11:52 Gwyneth Llewelyn Note Added: 0022018
2012-08-11 18:21 Gwyneth Llewelyn Note Added: 0022025
2012-08-12 01:29 usalabs Note Added: 0022029
2012-08-14 05:17 Gwyneth Llewelyn Note Added: 0022046
2012-08-14 10:37 usalabs Note Added: 0022051
2012-08-14 11:18 Gwyneth Llewelyn Note Added: 0022053
2012-08-14 12:57 kenvc Note Added: 0022055
2012-08-14 13:01 justincc Note Added: 0022056
2012-08-15 04:53 Gwyneth Llewelyn Note Added: 0022080
2012-08-18 04:32 DMX04 Issue cloned: 0006198
2012-08-27 12:15 kenvc Relationship added related to 0006198
2012-11-18 23:19 kenvc Note Edited: 0022055 View Revisions
2013-05-13 16:28 kenvc Note Added: 0023885
2013-09-14 08:21 kenvc Note Added: 0024347
2013-10-20 06:33 kenvc Note Added: 0024546
2013-10-20 22:07 Robert Adams Note Added: 0024551
2014-01-25 10:34 kenvc Note Added: 0025057
2014-01-25 10:34 kenvc Status confirmed => resolved
2014-01-25 10:34 kenvc Fixed in Version => master (dev code)
2014-01-25 10:34 kenvc Resolution open => fixed
2014-01-25 10:34 kenvc Assigned To => kenvc
2014-02-07 15:21 user2213 Relationship added related to 0007002
2014-07-29 13:42 chi11ken Status resolved => closed


Copyright © 2000 - 2012 MantisBT Group
Powered by Mantis Bugtracker