[Opensim-dev] Large page kernel tweaks for reduced memory prefetch penalties in high performance computing applications on linux

Sun Jun 15 23:52:49 UTC 2008

Hi all,

I am happy that issues related to performance, efficiency and stability 
of OpenSim on various platforms are getting more attention recently and 
that many folks are looking into it.

Together with 3Di (Japan) we are looking into those issues too, and will 
be more than happy to share our results and solutions in due time. We 
are putting efforts to put together a summary of our findings online and 
make it possible to monitor certain aspects of OpenSim performance and 
memory management into our nightly build system, so it will be easier 
for the community to monitor what is working where, on which platforms 
and with what results. I'll keep you updated on that.

As for the large page support: this is very architecture specific issue: 
different intel, amd and sparc MMUs support different page sizes, and 
then various OSes again can be compiled with a specific page size 
support - so there is no single solution for every platform/OS 
combination. Most common denominator seems to be the use of 4k and 8k 
and most configurations of architecture/OS use those page sizes by 
default. Solaris kernel can support multiple page sizes, although it is 
a bit more tricky than it sounds, and for example on SPARC not 
everything can be easily negotiated with the kernel. It all depends on 
which architecture a given OS runs.

Poor memory management and large memory footprint is not going to be 
solved by recompiling something with large page support - to the 
contrary - the footprint and memory usage would most likely got much 
worse then. Large page support is usually for apps that do their own 
memory management, and it is usually increasing the overall memory 
footprint and improving the performance. To really benefit from large 
pages the software must take exclusive care of memory management itself. 
And in case of OpenSim it is a bit tricky. Let me explain - normally - 
memory management (say, when programming in C) is difficult - one needs 
to make lots of decisions and trade-offs between 3 general issues: a) 
memory footprint b) performance c) maintainability.  To really make 
things fast and small one needs to re-implement the memory management 
herself, and take care of things - this will put strain on the 
maintainability but will keep both, memory footprint and performance in 
their best. What usually happens is that one uses standard libc to keep 
the maintanability high, and makes tradeoffs - to be fast and big, or 
slow and small.

With software running on VMs, there are much more levels to be 
considered when managing memory:
a) application level
b) VM level
c) OS level
In case of OpenSim, there is the following to consider:

1. object instances allocations, de-allocations, arrays and collection 
management, hashes, large memory management, database queries etc. On 
the application level lots of good things can be done by "normal" C# 
programmers.  This can dramatically boost performance and reduce the 
memory footprint. For example - re-implementing some of the collection 
classes usually renders good results. Reducing the number of new object 
creation, and "recycling" the objects inside the application instead of 
creating new instances and letting the system to garbage collect unused 
instances - this also can dramatically improve both, performance and 
memory footprint. And so on - good programming practices, taking care of 
memory usage and memory management can make things really better - 
especially on systems running on VMs. Even little things like boxing, 
and efficient use of native data types - this all contribute.

2. VM-level (being it Mono or any VM) - performance here can be tweaked 
by many parameters, but, the biggest contributor is garbage collector 
itself. Different garbage collectors have different ways of managing 
memory, and these can substantially change the way applications behave. 
 From our limited experiences to date, mono with different GCs behaves 
completely different - stability, performance and footprint are all 
highly sensitive to the GC used. We are getting quite good results when 
using the latest Boehm GC - but things can be tweaked even better.

3. OS - what I mean here is:
- the memory management left out and not handled by the APP itself or VM 
including large page support,
- I/O,
- threads management,
- IPC (especially shared memory),
- and networking.
These all can be tweaked. Normally these are designed to be generic and 
handle wide range of cases and apps. In case of OpenSim alone, things 
can be improved and tweaked for a particular purpose alone.

This is all pretty complex. There is no silver bullet that will 
magically make OpenSim run faster with small memory footprint. But - 
there are many areas improvement can be made, and it will be desirable 
to have a more targeted efforts towards that - Over here we are trying 
to draw a roadmap of all those various aspects, and I am grateful for 
good discussion and contribution from many people that put things in 
perspective.

For any of you doing any testing - please take a note on the exact 
kernel, mono version, GC used and post these together with your 
results/observations. This will help replicating some scenarios and 
digging into causes of various behaviours. For one thing, we were unable 
to replicate most of problems with Mono on our own Linux setups - as for 
Solaris on SPARC, these are highly sensitive to exact version and GC 
used - we have cases of complete mono crashes, to the system running 
well, subject to various tweaking of compile parameters. As said 
earlier, we want to put a report together, to gather all these in a 
single place, so others can compare it to what is observed and so on. 
I'll keep you posted on that,

-- 
cheers
Mariusz

James Stallings II wrote:
> Greetings,
> 
> Included below is a transcript of a recent sunday morning discussion in re:
> the mono/large pages stuff that's recently appeared on the radar.
> 
> as you will see, it is really more of a kernel-tweaking issue, although the
> application does come into play in the way it requests memory. For our
> purposes, 'application' in that last sentence is mono, not opensim.
> 
> Hope this provides some insights :)
> 
> Cheers
> daTwitch
> 
> Oh, still researching how to take advantage of this end-to-end wrt our
> application. Will update as I uncover more information.
> 
> 
> 
> <daTwitch> this is somewhat relevant:
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6664521
> <daTwitch> although I finf the placating sycophantic tone of the bug
> submitter makes me want to find him an emotional support group
> <nebadon> lol
> <daTwitch> the universe has surely reversed it's polarity; computer science
> (which is where I learned the term "egoless programming") is now saturated
> with sensitivity; and Fine Arts, once considered the most subjective subject
> under quanititative and qualititative analysis, is consumed with issues
> relating to process, review, and open, formal  critique
> <daTwitch> aat least, it was where I went to school lols
> <daTwitch> this is also relevant, if somewhat more out of date.
> Unfortunately, this looks almost identical to the things we're seeing, and
> given the age of the issue, and that we're still seeing it now, doesnt give
> me a lot of hope for getting the mono folk to take the problem on.
> <daTwitch>
> http://lists.ximian.com/pipermail/mono-list/2006-April/031312.html
> <daTwitch> although it is encouraging that the OpenSolaris folk claim to
> have fixed the problem with a patch to their O/S
> <daTwitch> maybe someone should investigate how this performs under
> opensolaris
> <daTwitch> The discussion of TLBs (translation buffers, which are crucial to
> page addressing in these memory models), in this article:
> http://lwn.net/Articles/173882/ suggests that some kernel optimizations on
> the server hardware in question can significantly improve the performance of
> memory accesses in general for a given program - if I read it right, it
> would indicate that we would need to build the correct optimizations into
> the k
> <daTwitch> ernel, then compile mono locally and link it as described
> <daTwitch> however, it may be that these effects would only be significant
> on 64bit O/S
> <daTwitch> that's about all I'm turning up of any significane
> <daTwitch> *significance
> <nebadon> hmm
> <nebadon> do you recall anything about compiling mono with --large page
> <nebadon> or large pages
> <nebadon> something like that
> <nebadon> someone was talking about it on -dev a while back
> <nebadon> they said it helped with memory stuff with mono
> <nebadon> i looked yesterday but couldnt find anything
> <nebadon> it wasnt  one of the regulars  on -dev channel though
> <daTwitch> that's what all the foregoing stuff is about
> <nebadon> they claimed it really helped alot
> <ckrinke> I dont see --large, but
> http://www.mono-project.com/Compiling_Monohas mention of a special Xen
> switch.
> <nebadon> at the time i was less interetsed in the topic though
> <nebadon> hmm
> <daTwitch> we were discussing it with JustinCC at the office hours y/d too
> <nebadon> yea
> <nebadon> i brought it up then
> <nebadon> i looked into it after the meeting
> <nebadon> and couldnt find anything
> <daTwitch> basically it comes down to this: the windows kernel allocates
> memory far differently than a unix kernel
> <daTwitch> and c#, as a result of being native to the platform, can take
> advantage of that to compress data as it does garbage collection
> <daTwitch> mono doesn't even try
> <nebadon>
> http://developer.amd.com/documentation/Articles/Pages/322006145.aspx
> <daTwitch> compress is the term used, but is not technically correct
> <nebadon> heres talk about its use in Java
> <daTwitch> imagine your large page as a hard disk sector in need of
> defragging
> <daTwitch> in fact, that is an incredibly accurate metaphor
> <daTwitch> windows defragments the data in memory
> <daTwitch> mono doesnt
> <nebadon> yea
> <nebadon> i recall them saying that mono
> <daTwitch> for the same reasons as a hard disk defrag and wit hsimilar
> benefits
> <nebadon> wastes the space  if because it requires more blocks that needed
> or something
> <nebadon> and lots of memory is wasted
> <daTwitch> yes
> <nebadon> unless large  pages is specified
> <daTwitch> precisely
> <daTwitch> ok, so we are long overdue making a mono with large pages then -
> would that be a valid assertion?
> <nebadon> yea
> <nebadon> id like it see it tested
> <nebadon> if we can figure out how
> <daTwitch> I'm sooooo on it
> <nebadon> sweet
> <daTwitch> I can build any thing
> <nebadon> great
> <daTwitch> as long as I have enough ram
> <nebadon> i think it will be a big help to see where it takes us
> <daTwitch> ok, I'll be busy for a bit
> <nebadon> k
> <nebadon> thanks man
> <daTwitch> I'll keep y'all posted
> <nebadon> great
> <ckrinke> maybe its a ./configure option and is something like
> --memory=large
> <daTwitch> quite possibly
> <nebadon> yea its something like that
> <nebadon> i wish i took notes
> <nebadon> but like i said at the time
> <nebadon> i was less interested
> <ckrinke> do the mono guys have an irc channel on FreeNode?
> <daTwitch> no idea
> <daTwitch> pulling source now
> <daTwitch> will see if I can locate their IRC channel
> <nebadon> cool
> <daTwitch> gimpnet servers only at irc.gnome.org and irc.gimp.net
> <daTwitch> #mono
> <daTwitch> #monodev
> <daTwitch> #mono-winforms
> <daTwitch> #monodevelop
> <daTwitch> #cocoa
> <daTwitch> #mono-hispano
> <daTwitch> #monouml
> <daTwitch> #gendarme
> <daTwitch> #mono-ally
> <daTwitch> #moonlight
> <daTwitch> moonlight == silverlight for mono
> <nebadon> nice
> <daTwitch> ok source is down, back to work
> <Ter_Afk> moonlight == loonmight?
> <daTwitch> heh
> <daTwitch> I dont even know what silverlight is, but I've heard discussion
> of it, so it was a point of interest
> <Ter_Afk> Microsoft's answer to Adobe Flash
> <daTwitch> ok, no mention whatsoever of a --large-pages option to the
> configuration
> <daTwitch> we have --large-heap
> <daTwitch> large_code
> <Ter_Afk> large_fire?
> <Ter_Afk> k, nuf with the word jokes.
> <daTwitch> does anyone know if it was large-pages, or large_pages?
> <nebadon> i dont recall
> <nebadon> i just remember the term large  pages being used some how
> <daTwitch> lol googling large pages turns up everything from beano to kirk
> douglas
> <nebadon> lol
> <nebadon> yea
> <nebadon> i had no luck on google
> <nebadon> nor the mono website
> <daTwitch> actually, I'm starting to think large_pages refers to a kernel
> setting
> <nebadon> well they said Compile Mono from source
> <nebadon> with the large pages switch
> <nebadon> i do remember that
> <nebadon> its probably related more to the compiler
> <nebadon> than mono
> <nebadon> so maybe we are looking in the wrong  places
> <daTwitch> hmmm
> <daTwitch> that's a clue
> <daTwitch> ok, I got configure to execute to completion very cleanly
> <daTwitch> gotta take 5 tho
> <daTwitch> bbiaf
> <nebadon> ok
> <daTwitch> ah needs mah gcc 4.2 doc
> <daTwitch> The Virtual Memory (VM) Subsystem
> <daTwitch> Most modern computer architectures support more than one memory
> page size. To illustrate, the IA-32 architecture supports either 4KB or 4MB
> pages. The 2.4 Linux kernel used to only utilize large pages for mapping the
> kernel image. In general, large page usage is primarily intended to provide
> performance improvements for high performance computing applications, as
> well as database applications that have large working sets. A
> <daTwitch> ny memory access intensive application that utilizes large
> amounts of virtual memory may obtain performance improvements by using large
> pages. Linux 2.6 can utilize 2MB or 4MB large pages, AIX uses 16MB large
> pages, whereas Solaris large pages are 4MB in size. The large page
> performance improvements are attributable to reduced translation lookaside
> buffer (TLB) misses. Large pages further improve the process of memory prefe
> <daTwitch> tching, by eliminating the necessity to restart prefetch
> operations on 4KB boundaries.
> <daTwitch> from: http://aplawrence.com/Linux/linux26_features.html
> <daTwitch> it's a feature that must have support in the kernel, at the very
> least
> <daTwitch> though I can find neither build-time nor runtime configuration
> points that take advantage of it in either gcc nor mono at this point
> <nebadon> hmm
> <daTwitch> still looking though ;)
> <nebadon> sounds like the problem though
> <daTwitch> yes, think we are in the process of pinning it down
> <nebadon> nice
> <daTwitch> even if we arent doing things to precisely duplicate how things
> go under c#, this should yield a performancve gain that compensates
> <daTwitch> I keep seeing the figure 10%
> <nebadon> yea thats a good start
> <daTwitch> that is significant when we consider how much we pay in memory
> per-av
> <daTwitch> here is some additional good background info, but still does not
> complete the picture:
> http://findarticles.com/p/articles/mi_m0ISJ/is_2_44/ai_n14793331/pg_10
> <daTwitch> mysql can also benefit heavily from the use of large pages
> <daTwitch> combining the benefits of mysql on large pages with our various
> servers on large pages (actually, the UGAIM could possibly take a
> performance *hit* from large pages) might yield even greater than 10%
> performance increase
> <daTwitch> probably the large pages switch to start with is a kernel
> boot-time config point
> <nebadon> nice
> <nebadon> i would think though a program like mysql would already be
> compiled to such a thing
> <daTwitch> well, no, not necesarily
> <nebadon> so the goal i assume
> <nebadon> is 4mb page size?
> <nebadon> vs 4k
> <daTwitch> the underlying kernel has to be configured to support it, and if
> the application isn't sufficiently demanding, it actually will take a
> performance hit
> <daTwitch> yes
> <daTwitch> 4mI think 16mb is also supported in 2.6+ kernels, but I doubt we
> need it yet
> <nebadon> yea it sounds to me like any kernel thats 2.6 its already enabled?
> <nebadon> but the app needs to be told to use it?
> <nebadon> its amazing how useless google is for this topic
> <nebadon> hehe
> <daTwitch> well, it's a bit obscure, unless you know what you're looking for
> <daTwitch> this is really about kernel tweaking, not so much mono
> <daTwitch> the kernel needs to be told to support it at boot time - perhaps
> even needs to be compiled for it
> <daTwitch> but the support is in the source
> <daTwitch> plus, not too many folks need to do this
> <daTwitch> only high perf types with really demanding software
> <daTwitch> (that would be us lols)
> <daTwitch> the app does have to be told to utilise it somehow though
> <nebadon> yea
> <nebadon> opensim is definatly more demanding than say apache
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Opensim-dev mailing list
> Opensim-dev at lists.berlios.de
> https://lists.berlios.de/mailman/listinfo/opensim-dev