Golden Hammer Software: 100k animated triangles at 30fps on iPhone

I've been optimizing OverPowered all week, and have managed to more than double the amount of bad guys we can support, and removed a 10 meg memory spike on load that was sending us low memory warnings. Here is some info that I've picked up in the process.

The strategy: Find the bottlenecks, kill the bottlenecks.

The iPhone is actually two processors, the CPU and the GPU. If one of those is eating up all the time, then it's not worth optimizing the other. OverPowered is an action game, so it's important to keep the framerate at least 30fps. At the start of the week we were overwhelmingly GPU bottlenecked at around 20k animated triangles per frame. I managed to narrow down enough fixes that it's now more worthwhile to optimize the CPU than to try to keep shrinking the GPU cost.

I used 4 tools this week.

An onscreen FPS counter with number of triangles drawn. This is the only really definitive way to know how fast your game is running, but it's extremely low resolution. The iPhone is vsync locked, so if you are at 30fps it will take a huge change to make it display anything else.
An in-game timer. I wrap timer calls around various functions to measure their real cost. This is a very useful way to get a high level view of where your frame time is going with a high degree of confidence. It can be run in release without much profiling overhead to skew the results. I have it spitting out the frame breakdown every time I exit a level.
Instruments: CPU sampler. This is a fairly lightweight sampling profiler. As long as you sanity check the results with the in-game timer it can be used to get a higher resolution view of bottlenecks.
Instruments: Allocations. This tool is absolutely awesome for telling you where your memory is being spent. All platforms should have a tool like this.
I did not use Shark. This can give you a better view than the CPU sampler, but it's much heavier weight. It takes longer to get results and try out changes. It's good if you have a specific set of functions that you really want to optimize at a low level.

Pixel fill rate:

The amount of pixels drawn seems to be the biggest deal on this platform. I read somewhere that you can draw the full screen about 5 times at 30 fps on the 3GS if nothing else is going on, and my own tests are about the same. If you draw a background image, then the ground, then a gui and a bunch of little objects you can easily be drawing the screen three times already if you set it up wrong.

The iphone supports a fast hidden surface removal with the deferred tile renderer. Opaque objects drawn on top of each other largely avoid the overdraw issue by doing an early cull of objects that will be fully drawn behind other objects within a tile. So…make your gui out of opaque rectangular textures? This isn't really an option.

Just be aware of the limitations on fill rate and design appropriately is all the advice I can give on this one. If your game design requires 10 fullscreen blended textures to be drawn on top of each other every frame, it's not going to work no matter how much work you do. Try to avoid drawing large blended textures if possible, and a large alpha-tested object is one of the worst things you can do for rendering performance.

I was trying to do an effect that draws the entire world to an offscreen buffer, then overlays that on the screen for pixel shader effects. I ended up having to abandon this approach after getting it working due to the pixel fill rate getting in the way.

Vertex upload speed:

When you use vertex arrays, the entire vertex buffer is uploaded to the GPU every frame. This causes the whole pipeline to stall out while the GPU waits for the memory transfer. VBOs can be used to eliminate this lag for static data, things that you rarely change. All things being equal the difference between vertex arrays (20k verts) and VBOs (100k verts!) is huge. The max number of verts you can push is probably much higher, but I have a game running and lots of pixel overdraw.

Our scene is now entirely static buffers. We use a vertex shader to do all of the animation on the GPU by passing static buffers that represent the frames to interpolate between, and a float argument to represent position in between the frames. This change alone let me double the amount of onscreen badguys.

It's also the reason why we won't be able to support the original iphone and 3g for OverPowered. The difference in power is too much to be able to max out the newer phones while still trying to run on the older ones for a small company. When I pause the game I can fill the screen completely with quake 3 models without dropping below 30fps, so the bottleneck has been moved away from the rendering code and into the game/physics/render-setup code.

The vertex processor seems to be very powerful compared to the rest of the pipeline. I have not seen any slowdown from making the vertex shader more complicated so far, so I plan on abusing this as much as possible. Here is the relevant part of my shader code.

uniform mat4 ModelViewProj;

// the position of the low frame

attribute vec4 position;
// the position of the high frame from a different buffer

attribute vec4 diffuse;
// the pct of progress the animation has run between the two frames

uniform mediump float PctLow;

void main()

{

vec4 interpolatedPos = mix(position, diffuse, PctLow);

gl_Position = ModelViewProj * interpolatedPos;

}

Texture size is important:

I reduced a gui texture from 256x256 png to 64x64 and saw good results. This is a texture that's drawn a bunch of times every frame. All of our opaque textures are pvr4 compressed, and that was also a big win over uncompressed textures.

Memory spikes in ObjectiveC:

The 10 meg spike that I mentioned was in our platform layer due to bad use of the garbage collector. We generally garbage collect at the end of every frame. Usually we can get away with this because most of our allocations are in C++, and they go away as soon as we tell them to. Texture loading is an exception because it happens in platform code.

During the course of loading a level, we'd read in a texture buffer, create the OpenGL texture, and then release the texture buffer. This works ok if only one texture is loaded that frame, but not if we are loading a whole level at once. If you identify a place like this in ObjectiveC code, an easy fix is to put a NSAutoreleasePool around it.

NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

// do stuff

[pool release];

OpenGL State Changes:

These don't seem to be causing me any troubles right now. Our renderer has always been pretty good at batching materials, so it's not an issue I had to touch on this week.

Golden Hammer Software

Tuesday, September 28, 2010

100k animated triangles at 30fps on iPhone

1 comment: