Golden Hammer Software: rendering

Showing posts with label rendering. Show all posts

Tuesday, October 12, 2010

Converting from OpenGL ES1 to ES2 on the iPhone

I recently got through upgrading our engine to support ES2 and GLSL shaders. It took about a week to get the game just looking the same as it did before, but rendering with shaders instead. I'm sharing some info that might be worthwhile to anyone else trying to update their iPhone renderers. This is not a how-to on using GLSL to achieve different effects, you can find plenty of that elsewhere.

ES2 is not an incremental improvement to ES1, it is a total paradigm shift in how pixels get rendered. You can't just take an ES1 renderer and add a couple shaders here and there like you can in DirectX. In ES2, you write vertex and pixel/fragment shaders in GLSL, and then pass values to the shaders at runtime.

The vertex shader reads in any values from VBOs or vertex arrays, and outputs any values that are useful for the pixel shader. Any values created in the vertex shader are interpolated along the triangle edges and raster lines before being passed to the pixel shader. The pixel/fragment shader has only one job, to output a color value.

I've found this site to be a nice reference for GLSL functions, starting at section 8.1: http://www.khronos.org/files/opengl-quick-reference-card.pdf

3GS/iPhone4/iPad or bust:

ES2 is not supported on the 3G, first gen iPhone, or first gen iPod. It's possible to support both ES1 and ES2 in the same codebase, but you will need two entirely different render paths. There's no mixing and matching allowed, so within a run you are either entirely ES1 or entirely ES2.

EAGLContext* eaglContext = 0;

eaglContext = [[EAGLContext alloc] initWithAPI:kEAGLRenderingAPIOpenGLES2];

if (eaglContext && [EAGLContext setCurrentContext:eaglContext])

{

// initialize a renderer that uses ES2 imports

}

else

{

eaglContext = [[EAGLContext alloc] initWithAPI:kEAGLRenderingAPIOpenGLES1];

if (!eaglContext || ![EAGLContext setCurrentContext:eaglContext]) {

// total failure!

}

// initialize a renderer that uses ES1 imports

}

Loading and assigning shaders:

The shader compiler deals in character buffers. You will need to either create a GLSL stream in code, or more sanely load up a file containing a shader and pass the contents to the compiler.

int loadShader(GLenum type, const char* glslSourceBuf)

{
int ret = glCreateShader(type);

if (ret == 0) return ret;

glShaderSource(ret, 1, (const GLchar**)&glslSourceBuf, NULL);

glCompileShader(ret);

int success;

glGetShaderiv(ret, GL_COMPILE_STATUS, &success);

if (success == 0)

{

char errorMsg[2048];

glGetShaderInfoLog(ret, sizeof(errorMsg), NULL,
errorMsg);

outputDebugString("%s error: %s\n", fileName, errorMsg);

glDeleteShader(ret);

ret = 0;

}

return ret;

}

int loadShaderProgram(const char* vertSource, const char* pixelSource)

{

// load in the two individual shaders

int vertShader = loadShader(GL_VERTEX_SHADER, vertSource);

int pixelShader = loadShader(GL_FRAGMENT_SHADER, pixelSource);

// create a "program" which is a vertex/pixel shader pair.

int ret = glCreateProgram();

if (ret == 0) return ret;

glAttachShader(ret, vertShader);

glAttachShader(ret, pixelShader);

// assign vertex attributes to positions inside
// glVertexAttribPointer calls

glBindAttribLocation(ret, AP_POS, "position");

glBindAttribLocation(ret, AP_NORMAL, "normal");

glBindAttribLocation(ret, AP_DIFFUSE, "diffuse");

glBindAttribLocation(ret, AP_SPECULAR, "specular");

glBindAttribLocation(ret, AP_UV1, "uv1");

glLinkProgram(ret);

int linked;

glGetProgramiv(ret, GL_LINK_STATUS, &linked);

if (linked == 0)

{

glDeleteProgram(ret);

outputDebugString("Failed to link shader program.");

return 0;

}

return ret;

}

void drawSomething(void)

{

// tell opengl which shaders to use for rendering

glUseProgram(mShaderProgram);

// set any values on the shader that you want to use.

// set up the vertex buffer using glVertexAttribPointer
// calls and the same positions used during the linking.
// then draw like usual.

glDrawElements(GL_TRIANGLES, numTris, GL_UNSIGNED_SHORT, 0);

}

No matrix stack:

All transformations are done in the shader, so anything using glMatrixMode is automatically out. glFrustumf and glOrthof are also gone, so you will need to write replacements. You can find examples of these two functions in the Android codebase at http://www.google.com/codesearch/p?hl=en#uX1GffpyOZk/opengl/libagl/matrix.cpp&q=glfrustumf%20lang:c++&sa=N&cd=1&ct=rc&l=7.

For the transforms used by shaders, I have callbacks to grab values like ModelToView and ViewToProj from a structure that I calculate once per render pass.

In C++:

unsigned int transformShaderHandle = glGetUniformLocation(shaderId, "ModelToScreen");

glUniformMatrix4fv(transformShaderHandle, 1, GL_FALSE, (GLfloat*)mModelViewProj );

In the vertex shader:

uniform mat4 ModelToScreen;

attribute vec4 position;

void main()

{

gl_Position = ModelToScreen * position;

}

More textures!

You only get two texture channels to use under ES1. ES2 gives you 8. Setting up a texture in ES2 is similar to ES1, but you don't get the various glTexEnvi functions to define how multiple texture channels blend together. You do that part in GLSL instead.

In C++:

unsigned int textureShaderHandle = glGetUniformLocation(shaderId, "Texture0");

// tell the shader that Texture0 will be on texture channel 0

glUniform1i(textureShaderHandle, 0);

// then set up the texture like you would in ES1

glActiveTexture(GL_TEXTURE0);

glEnable(GL_TEXTURE_2D);
glBindTexture(GL_TEXTURE_2D, mTextureId);

In the pixel shader:

uniform lowp sampler2D Texture0;
void main()
{
gl_FragColor = texture2D(Texture0, v_uv1);
}

No such thing as glEnableClientState(GL_NORMAL_ARRAY)

GL_NORMAL_ARRAY, GL_COLOR_ARRAY, etc have all gone away. Instead you use the unified glVertexAttribPointer interface to push vertex buffer info to the shaders. This is a pretty simple change.

glEnableVertexAttribArray(AP_NORMAL);

glVertexAttribPointer(AP_NORMAL, 3, GL_FLOAT, false, vertDef.getVertSize(), (GLvoid*)(vertDef.getNormalOffset()*4));

Tuesday, September 28, 2010

100k animated triangles at 30fps on iPhone

I've been optimizing OverPowered all week, and have managed to more than double the amount of bad guys we can support, and removed a 10 meg memory spike on load that was sending us low memory warnings. Here is some info that I've picked up in the process.

The strategy: Find the bottlenecks, kill the bottlenecks.

The iPhone is actually two processors, the CPU and the GPU. If one of those is eating up all the time, then it's not worth optimizing the other. OverPowered is an action game, so it's important to keep the framerate at least 30fps. At the start of the week we were overwhelmingly GPU bottlenecked at around 20k animated triangles per frame. I managed to narrow down enough fixes that it's now more worthwhile to optimize the CPU than to try to keep shrinking the GPU cost.

I used 4 tools this week.

An onscreen FPS counter with number of triangles drawn. This is the only really definitive way to know how fast your game is running, but it's extremely low resolution. The iPhone is vsync locked, so if you are at 30fps it will take a huge change to make it display anything else.
An in-game timer. I wrap timer calls around various functions to measure their real cost. This is a very useful way to get a high level view of where your frame time is going with a high degree of confidence. It can be run in release without much profiling overhead to skew the results. I have it spitting out the frame breakdown every time I exit a level.
Instruments: CPU sampler. This is a fairly lightweight sampling profiler. As long as you sanity check the results with the in-game timer it can be used to get a higher resolution view of bottlenecks.
Instruments: Allocations. This tool is absolutely awesome for telling you where your memory is being spent. All platforms should have a tool like this.
I did not use Shark. This can give you a better view than the CPU sampler, but it's much heavier weight. It takes longer to get results and try out changes. It's good if you have a specific set of functions that you really want to optimize at a low level.

Pixel fill rate:

The amount of pixels drawn seems to be the biggest deal on this platform. I read somewhere that you can draw the full screen about 5 times at 30 fps on the 3GS if nothing else is going on, and my own tests are about the same. If you draw a background image, then the ground, then a gui and a bunch of little objects you can easily be drawing the screen three times already if you set it up wrong.

The iphone supports a fast hidden surface removal with the deferred tile renderer. Opaque objects drawn on top of each other largely avoid the overdraw issue by doing an early cull of objects that will be fully drawn behind other objects within a tile. So…make your gui out of opaque rectangular textures? This isn't really an option.

Just be aware of the limitations on fill rate and design appropriately is all the advice I can give on this one. If your game design requires 10 fullscreen blended textures to be drawn on top of each other every frame, it's not going to work no matter how much work you do. Try to avoid drawing large blended textures if possible, and a large alpha-tested object is one of the worst things you can do for rendering performance.

I was trying to do an effect that draws the entire world to an offscreen buffer, then overlays that on the screen for pixel shader effects. I ended up having to abandon this approach after getting it working due to the pixel fill rate getting in the way.

Vertex upload speed:

When you use vertex arrays, the entire vertex buffer is uploaded to the GPU every frame. This causes the whole pipeline to stall out while the GPU waits for the memory transfer. VBOs can be used to eliminate this lag for static data, things that you rarely change. All things being equal the difference between vertex arrays (20k verts) and VBOs (100k verts!) is huge. The max number of verts you can push is probably much higher, but I have a game running and lots of pixel overdraw.

Our scene is now entirely static buffers. We use a vertex shader to do all of the animation on the GPU by passing static buffers that represent the frames to interpolate between, and a float argument to represent position in between the frames. This change alone let me double the amount of onscreen badguys.

It's also the reason why we won't be able to support the original iphone and 3g for OverPowered. The difference in power is too much to be able to max out the newer phones while still trying to run on the older ones for a small company. When I pause the game I can fill the screen completely with quake 3 models without dropping below 30fps, so the bottleneck has been moved away from the rendering code and into the game/physics/render-setup code.

The vertex processor seems to be very powerful compared to the rest of the pipeline. I have not seen any slowdown from making the vertex shader more complicated so far, so I plan on abusing this as much as possible. Here is the relevant part of my shader code.

uniform mat4 ModelViewProj;

// the position of the low frame

attribute vec4 position;
// the position of the high frame from a different buffer

attribute vec4 diffuse;
// the pct of progress the animation has run between the two frames

uniform mediump float PctLow;

void main()

{

vec4 interpolatedPos = mix(position, diffuse, PctLow);

gl_Position = ModelViewProj * interpolatedPos;

}

Texture size is important:

I reduced a gui texture from 256x256 png to 64x64 and saw good results. This is a texture that's drawn a bunch of times every frame. All of our opaque textures are pvr4 compressed, and that was also a big win over uncompressed textures.

Memory spikes in ObjectiveC:

The 10 meg spike that I mentioned was in our platform layer due to bad use of the garbage collector. We generally garbage collect at the end of every frame. Usually we can get away with this because most of our allocations are in C++, and they go away as soon as we tell them to. Texture loading is an exception because it happens in platform code.

During the course of loading a level, we'd read in a texture buffer, create the OpenGL texture, and then release the texture buffer. This works ok if only one texture is loaded that frame, but not if we are loading a whole level at once. If you identify a place like this in ObjectiveC code, an easy fix is to put a NSAutoreleasePool around it.

NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

// do stuff

[pool release];

OpenGL State Changes:

These don't seem to be causing me any troubles right now. Our renderer has always been pretty good at batching materials, so it's not an issue I had to touch on this week.