Feature Proposals/BulletSim OpenCL
From OpenSimulator
Allquixotic (Talk | contribs) (→Introduction and Problem Statement) |
Allquixotic (Talk | contribs) (→Brainstorming) |
||
Line 98: | Line 98: | ||
If you can answer any of these questions, please [http://opensimulator.org/index.php?title=Feature_Proposals/BulletSim_OpenCL&action=edit edit this page] and fill in an answer beneath each question! | If you can answer any of these questions, please [http://opensimulator.org/index.php?title=Feature_Proposals/BulletSim_OpenCL&action=edit edit this page] and fill in an answer beneath each question! | ||
− | + | # Let's assume that we're going to target the latest Bullet code from version control, and evolve our code to match as Bullet evolves. We should do this ''at least'' until Bullet 3.0 is released as stable, because major improvements to the OpenCL rigid body pipeline are available in git that are not in the latest stable release as of this writing. | |
− | + | # What parts of Bullet are currently GPU-accelerated in the 3.x preview codebase? | |
− | + | # Of the parts of Bullet that are GPU-accelerated, does OpenSimulator use any of them? | |
− | + | # What configuration or API usage changes are required of OpenSimulator's use of Bullet in order to use the GPU-accelerated features? | |
− | + | ## Can we simply enable Bullet's GPU acceleration by changing a configuration setting, or an initialization flag? | |
− | + | ## Do we have to use entire new classes in the Bullet C++ API to use the GPU-accelerated features? | |
− | + | ## Do we have to change our entire approach to using Bullet in the BulletSim physics backend to use the GPU-accelerated features? | |
− | + | # What hardware and software configurations should we test on? | |
− | + | # What is the minimum hardware specification that would actually yield a performance improvement over the CPU pipeline? | |
− | + | # Even if we successfully accelerate Bullet to a high degree, are we still bottlenecked by disk, memory bandwidth, locking primitives in OpenSimulator, or the .NET runtime, inhibiting our scalability past a certain point? If so, how far can we go before we hit this wall? Does GPU acceleration buy us anything, or are we already ''against'' that wall with the capabilities of CPU-based physics? | |
− | + | ||
= GPU OpenCL Support = | = GPU OpenCL Support = |
Revision as of 09:44, 10 September 2013
Basic Information | |||||||
---|---|---|---|---|---|---|---|
Proposal Name | BulletSim GPU Acceleration Using OpenCL | ||||||
Date Proposed | September 10, 2013 | ||||||
Status | Draft | ||||||
Proposer | User:Allquixotic |
Contents |
Introduction and Problem Statement
Right now, physical object movement and collision consumes significant CPU time for OpenSimulator. There are several popular sims that have either disabled physics completely, or severely restricted its use to the bare minimum. Use cases of OpenSimulator that desire to use lots of physical objects, collide them together in interesting ways, etc. will quickly peg their CPU, which slows down the simulator FPS, introduces time dilation, and increases network latency due to the CPU being pegged. This is observable even on high-end, current-generation single processor systems (e.g. Core i7 4770K), on any platform. Even on hardware that can easily handle a large number of objects, this problem negatively impacts region "density" (how many regions you can fit on a single server).
Physics-based simulations and interactions may continue to be optimized on the CPU by switching from the OpenDynamicsEngine physics backend to the the Bullet Physics Engine, because Bullet is significantly faster and more optimized than ODE. However, even Bullet has its upper limit of capabilities on the CPU. Furthermore, there is a certain tradeoff to be made between accuracy/precision and speed, as depicted in the following table. The point is, by raising our computational ceiling, we can either achieve better precision, larger scale (more or more complex objects that are physical), or some tradeoff improving both aspects to a lesser degree. The other point is that GPU acceleration is currently the most cost-sensitive way to raise the performance ceiling (can you afford a supercomputer?).
Caveat: I am well aware that the estimates of scale below for the number of prims is potentially inaccurate by several orders of magnitude up or down. These are rough estimates. I am also aware that different types of physics interactions and different shapes have vastly different computational costs; for instance, it is much easier to calculate collision between two cubes compared to two toruses (torii?). The general point, however, remains valid.
Physics Tradeoffs | |||||||
---|---|---|---|---|---|---|---|
Precision | # Physical Prims | Performance On Commodity Hardware | Hardware Class For Acceptable Performance | ||||
Baseline: Yay, we have physics! ...somewhat. | |||||||
Poor | Few (10s) | Excellent | Laptop, netbook, or embedded | ||||
Typical (CPU-only BulletSim or ODE on desktops, single-CPU servers, etc.) | |||||||
Good | Few (10s) | Acceptable | Commodity desktop or small server | ||||
Poor | Many (hundreds or thousands) | Acceptable | Commodity desktop or small server | ||||
Not currently possible for most individuals and small businesses | |||||||
Good | Many (hundreds or thousands) | Poor | Large server (multi-CPU) OR GPU-accelerated | ||||
Poor | Hundreds of thousands or millions | Poor | Large server (multi-CPU) OR GPU-accelerated | ||||
Possible in the future with GPU acceleration | |||||||
Extreme | Many (hundreds or thousands) | Awful | High-end GPU or multiple GPUs; impractical with non-supercomputer CPUs | ||||
Good | Hundreds of thousands or millions | Awful | High-end GPU or multiple GPUs; impractical with non-supercomputer CPUs |
Proposal
Since we already have a physics backend that uses Bullet Physics Engine, and since Bullet upstream itself is developing a GPU-accelerated physics pipeline, "all" we have to do is to take advantage of that pipeline in our code. Successful implementations will notice reduced CPU usage, the possibility of increased region density, or the ability to remove restrictions on tenants, like, "make sure you have no more than 10 physical objects at any time" (for example).
General Observations
- In recent years, both Intel and AMD have started shipping desktop and server CPUs which contain an IGPU (Integrated Graphics Processing Unit).
- The capabilities of IGPUs have been rapidly accelerating -- in fact, they have been increasing at a much higher rate than the CPU part of the chip. GPU performance is still roughly following Moore's Law, while CPU performance is leveling off in a huge way.
- Many computers running OpenSimulator, whether a spare desktop in someone's home or a dedicated server in a datacenter, have either an IGPU or a DGPU (Discrete Graphics Processing Unit) which is, to a greater or lesser extent, underutilized -- meaning that the resources are available, but are sitting entirely or mostly idle.
- The state of the art of graphics drivers has advanced significantly, to the point that IGPUs and DGPUs by AMD, Intel and Nvidia have available OpenCL 1.1 implementations on major platforms (Windows, Mac, and Linux).
- The Bullet Physics Engine development community is gradually shifting their own focus towards improving and optimizing Bullet for GPUs, and offering more physics operations being accelerated by the GPU rather than the CPU.
- While Bullet supports DirectX and CUDA to various extents (these are also APIs to access the GPU), it also supports OpenCL. OpenCL is one of, if not the only industry-standard, general-purpose GPU computing language that is available on all the platforms that OpenSimulator officially supports (Windows, Mac and Linux), and on all the major GPU vendor hardware (Intel, AMD, and Nvidia).
Conclusion
Enabling GPU-accelerated Bullet physics via OpenCL is the obvious path forward to unlocking the next level of scalability and/or precision of physics simulations in OpenSimulator.
Scope of Work
Brainstorming
If you can answer any of these questions, please edit this page and fill in an answer beneath each question!
- Let's assume that we're going to target the latest Bullet code from version control, and evolve our code to match as Bullet evolves. We should do this at least until Bullet 3.0 is released as stable, because major improvements to the OpenCL rigid body pipeline are available in git that are not in the latest stable release as of this writing.
- What parts of Bullet are currently GPU-accelerated in the 3.x preview codebase?
- Of the parts of Bullet that are GPU-accelerated, does OpenSimulator use any of them?
- What configuration or API usage changes are required of OpenSimulator's use of Bullet in order to use the GPU-accelerated features?
- Can we simply enable Bullet's GPU acceleration by changing a configuration setting, or an initialization flag?
- Do we have to use entire new classes in the Bullet C++ API to use the GPU-accelerated features?
- Do we have to change our entire approach to using Bullet in the BulletSim physics backend to use the GPU-accelerated features?
- What hardware and software configurations should we test on?
- What is the minimum hardware specification that would actually yield a performance improvement over the CPU pipeline?
- Even if we successfully accelerate Bullet to a high degree, are we still bottlenecked by disk, memory bandwidth, locking primitives in OpenSimulator, or the .NET runtime, inhibiting our scalability past a certain point? If so, how far can we go before we hit this wall? Does GPU acceleration buy us anything, or are we already against that wall with the capabilities of CPU-based physics?
GPU OpenCL Support
Here is just a table describing the current state of GPU-accelerated OpenCL support on various platforms and hardware. I choose to arbitrarily define the following:
- "Windows" as "Windows Vista SP2 or later, 32 or 64-bit";
- "Mac" as "Mac OS X 10.7 or later";
- "Linux Proprietary" as "Linux 2.6.32 or later on RHEL 6+, Debian 6+, Ubuntu 12.04+, Fedora 17+, OpenSUSE 12.1+, with a closed-source vendor-supplied GPU kernel module";
- "Linux FOSS" as "Linux 3.10 or later on the latest stable Fedora, Ubuntu, or OpenSUSE, or Debian Testing, with an open-source, Mesa or Gallium3d graphics driver".
Vendor | Minimum Hardware Generation | Windows | Mac | Linux Proprietary | Linux FOSS |
---|---|---|---|---|---|
AMD | Radeon HD5000 or later
FirePro W600 or later |
Fully Supported | Fully Supported | Fully Supported | Experimental |
Nvidia | GeForce 400 Series or later
Quadro 400 Series or later Tesla |
Fully Supported | Fully Supported | Fully Supported | Experimental |
Intel | "Ivy Bridge" Desktop/Mobile (Core i3/i5/i7 3xxx) or later
"Ivy Bridge" Server (Xeon 1275v2) or later "Knights Corner" (Xeon Phi) or later Excluding enthusiast and multi-CPU chips (Ivy-E, etc.) |
Fully Supported | Fully Supported | N/A | Experimental |