Feature Proposals/BulletSim OpenCL

= Introduction and Problem Statement =

Right now, physical object movement and collision consumes significant CPU time for OpenSimulator. There are several popular sims that have either disabled physics completely, or severely restricted its use to the bare minimum. Use cases of OpenSimulator that desire to use lots of physical objects, collide them together in interesting ways, etc. will quickly peg their CPU, which slows down the simulator FPS, introduces time dilation, and increases network latency due to the CPU being pegged. This is observable even on high-end, current-generation single processor systems (e.g. Core i7 4770K), on any platform. Even on hardware that can easily handle a large number of objects, this problem negatively impacts region "density" (how many regions you can fit on a single server).

Physics-based simulations and interactions may continue to be optimized on the CPU by switching from the OpenDynamicsEngine physics backend to the the Bullet Physics Engine, because Bullet is significantly faster and more optimized than ODE. However, even Bullet has its upper limit of capabilities on the CPU. Furthermore, there is a certain tradeoff to be made between accuracy/precision and speed, as depicted in the following table. The point is, by raising our computational ceiling, we can either achieve better precision, larger scale (more or more complex objects that are physical), or some tradeoff improving both aspects to a lesser degree. The other point is that GPU acceleration is currently the most cost-sensitive way to raise the performance ceiling (can you afford a supercomputer?).

Caveat: I am well aware that the estimates of scale below for the number of prims is potentially inaccurate by several orders of magnitude up or down. These are rough estimates. I am also aware that different types of physics interactions and different shapes have vastly different computational costs; for instance, it is much easier to calculate collision between two cubes compared to two toruses (torii?). The general point, however, remains valid.

= Proposal =

Since we already have a physics backend that uses Bullet Physics Engine, and since Bullet upstream itself is developing a GPU-accelerated physics pipeline, "all" we have to do is to take advantage of that pipeline in our code. Successful implementations will notice reduced CPU usage, the possibility of increased region density, or the ability to remove restrictions on tenants, like, "make sure you have no more than 10 physical objects at any time" (for example).

General Observations

 * In recent years, both Intel and AMD have started shipping desktop and server CPUs which contain an IGPU (Integrated Graphics Processing Unit).
 * The capabilities of IGPUs have been rapidly accelerating -- in fact, they have been increasing at a much higher rate than the CPU part of the chip. GPU performance is still roughly following Moore's Law, while CPU performance is leveling off in a huge way.
 * Many computers running OpenSimulator, whether a spare desktop in someone's home or a dedicated server in a datacenter, have either an IGPU or a DGPU (Discrete Graphics Processing Unit) which is, to a greater or lesser extent, underutilized -- meaning that the resources are available, but are sitting entirely or mostly idle.
 * The state of the art of graphics drivers has advanced significantly, to the point that IGPUs and DGPUs by AMD, Intel and Nvidia have available OpenCL 1.1 implementations on major platforms (Windows, Mac, and Linux).
 * The Bullet Physics Engine development community is gradually shifting their own focus towards improving and optimizing Bullet for GPUs, and offering more physics operations being accelerated by the GPU rather than the CPU.
 * While Bullet supports DirectX and CUDA to various extents (these are also APIs to access the GPU), it also supports OpenCL. OpenCL is one of, if not the only industry-standard, general-purpose GPU computing language that is available on all the platforms that OpenSimulator officially supports (Windows, Mac and Linux), and on all the major GPU vendor hardware (Intel, AMD, and Nvidia).

Conclusion
Enabling GPU-accelerated Bullet physics via OpenCL is the obvious path forward to unlocking the next level of scalability and/or precision of physics simulations in OpenSimulator.

= Scope of Work =

Brainstorming Questions
If you can answer any of these questions, please edit this page and fill in an answer beneath each question!


 * 1) Let's assume that we're going to target the latest Bullet code from version control, and evolve our code to match as Bullet evolves. We should do this at least until Bullet 3.0 is released as stable, because major improvements to the OpenCL rigid body pipeline are available in git that are not in the latest stable release as of this writing.
 * 2) What parts of Bullet are currently GPU-accelerated in the 3.x preview codebase?
 * 3) Of the parts of Bullet that are GPU-accelerated, does OpenSimulator use any of them?
 * 4) What configuration or API usage changes are required of OpenSimulator's use of Bullet in order to use the GPU-accelerated features?
 * 5) Can we simply enable Bullet's GPU acceleration by changing a configuration setting, or an initialization flag?
 * 6) Do we have to use entire new classes in the Bullet C++ API to use the GPU-accelerated features?
 * 7) Do we have to change our entire approach to using Bullet in the BulletSim physics backend to use the GPU-accelerated features?
 * 8) What hardware and software configurations should we test on?
 * 9) What is the minimum hardware specification that would actually yield a performance improvement over the CPU pipeline?
 * 10) Even if we successfully accelerate Bullet to a high degree, are we still bottlenecked by disk, memory bandwidth, locking primitives in OpenSimulator, or the .NET runtime, inhibiting our scalability past a certain point? If so, how far can we go before we hit this wall? Does GPU acceleration buy us anything, or are we already against that wall with the capabilities of CPU-based physics?

Advantages and Challenges
Advantages:
 * Improved performance.
 * Reduced CPU usage.
 * Ability to use more physical objects with the same performance goals.
 * Unlock a new level of complexity and dynamicity in OpenSimulator (although other components, such as users' graphics cards, may take time to catch up).

Challenges:
 * A lot of existing hardware hosting OpenSimulator does not have any GPU at all.
 * Acquiring a dedicated GPU for a server that does not have one can be very expensive, especially if you have to custom request one from your hosting provider.
 * Support for open source Linux graphics drivers running OpenCL is still in a very early stage (as of September 2013).

Implementation Ideas

 * 1) We should have a config file option in OpenSim.ini under a BulletSim heading for whether or not to enable OpenCL. Trying to be smart and autodetect might lead to problems, such as a user with an unstable graphics driver accidentally crashing their system (and then if OpenSim starts on boot, that's REALLY bad because their system crashes during the boot procedure!). We absolutely must have a way for the user to enable or disable OpenCL support, and leave it disabled by default out of the box.


 * 1) We should not throw away our existing code that uses the CPU paths of Bullet. The Bullet non-OpenCL paths are probably faster running on the CPU than a CPU implementation of OpenCL running the OpenCL paths of Bullet, simply because a layer of complexity is removed.


 * 1) If Bullet doesn't already do this, we should refuse to run OpenCL on the CPU except for testing and debugging purposes (maybe a compile flag).
 * 2) We should look at Bullet developer and user blogs, wikis, code comments, etc. and attempt to figure out what parts of the GPU rendering pipeline provide the greatest performance improvement vs. CPU, and run the best on GPUs, and exploit that to the best of our ability, while avoiding or minimizing the use of features that the GPU does not handle as well.


 * 1) We should investigate the possibility of continuing to use the CPU for physics, especially if there are certain interactions or edge cases that are horribly slow on a GPU, or not implemented in the GPU pipeline on Bullet. Better yet, we could develop some sort of physics scheduler, that first attempts to max out the GPU, and if there are still physics calculations that need to be done, push the remainder of them onto the CPU until the CPU is also saturated. On high-end servers, the CPUs do indeed allow for a non-trivial amount of physics to be done, so we needn't leave it all up to the GPU

= GPU OpenCL Support =

Here is just a table describing the current state of GPU-accelerated OpenCL support on various platforms and hardware. I choose to arbitrarily define the following:


 * "Windows" as "Windows Vista SP2 or later, 32 or 64-bit";
 * "Mac" as "Mac OS X 10.7 or later";
 * "Linux Proprietary" as "Linux 2.6.32 or later on RHEL 6+, Debian 6+, Ubuntu 12.04+, Fedora 17+, OpenSUSE 12.1+, with a closed-source vendor-supplied GPU kernel module";
 * "Linux FOSS" as "Linux 3.10 or later on the latest stable Fedora, Ubuntu, or OpenSUSE, or Debian Testing, with an open-source, Mesa or Gallium3d graphics driver".