Adventures in Deferred Rendering: Tiled Deferred Shading and Compute Shaders

Tiled deferred shading is one of my favorite techniques because it introduces many concepts that are easy to take for granted when we rely on rasterization and the flow of the graphics pipeline. In basic deferred shading, we evaluate each light separately and add the results to the output buffer, which if you remember from my previous entry on deferred rendering, serves as the starting point from which we begin the post-processing.

Shading a single quad is certainly much more scalable than iterating through each geometry to apply a given material. However, there is one bottleneck that needs to be addressed: when we’re dealing with thousands of lights, rendering becomes expensive because we need to process all lights for each pixel, and each evaluation requires either a separate shader invocation or an additional loop iteration inside a fragment shader.

Enter tiled shading, which first appeared in the game Uncharted: Drake’s Fortune. The basic idea of tiled shading is to map light sources to tiles of pixels, thereby capping the number of lights that need to be evaluated at each pixel.

We can think of tiled rendering as something like a low-resolution rendering of the scene that can be performed on the CPU; however, I wanted to leverage the power of the GPU and build my understanding of the hardware ramifications surrounding compute shaders. Thus, the rest of this post will be a deep dive into the inner workings of thread work distribution, memory access control, and of course, our chosen light culling routine.

There are several principles to keep in mind regarding tiled shading. First, we are storing the lighting data on a per tile basis instead of by pixel. This is somewhat conservative because a given light source may not affect every single pixel in the work group (i.e., tile). However, the reductions in processing time and storage are well worth the perceived trade-off of having to deal with an inaccurate data structure.

Second, tiled shading (at least in my implementation) reconfigures the order in which lighting calculations and final colors are distilled. Specifically, I chose to drive everything beyond establishing the G-buffer with my compute shader so that the subsequent render pass only needs to read from a quad to efficiently render the scene.

To reiterate, we want to map each light to every compute shader work group with which it intersects and then drive shader evaluation for each pixel once we finalize the list of relevant lights for any given work group. When this list is applied in conjunction with a single read of the G-buffer , the result is quite a bit more amenable to better frame rates.

Everything will become clear when I review the code.

Compute Shaders

Compute shaders operate on a per work group basis, which lends itself nicely to the tile abstraction I talked about earlier. Concretely, any compute shader invocation understands only the space in which it operates, which is defined by the number or position of its work group, the position of the specific shader invocation (i.e., pixel) within that work group, and the position of that invocation in the compute dispatch call, which we usually take to mean the pixel’s position on a texture.

There are several mechanisms in play that make this concept of a “work group” coherent. Each work group can possess a set of shared variables that can be operated on by a number of atomic operations. Additionally, when we talk about the existence of global memory that is presumably filled with our lighting data via the mechanisms of a shader storage buffer object (unordered access view in DirectX), as well as texture data via the image load/store mechanism, it becomes clear that compute shaders exist to minimize the latency of threads operating on adjacent memory.

Physical distances matter, and one of the challenges of hardware design is to empower impressive processing throughput via intelligent placement of processing and memory units so that there is a sensible distance between a given execution context and some memory allocation on the hierarchy. The smaller the distance, the faster a thread or context can access the memory, and the faster the overall execution (provided that the other processing units are also optimally placed and in reasonable numbers).

The name of the game is simple: Have the execution context operate with respect to the more favorable shared memory as much as possible and only dive intermittently into global memory when we need to do something important, such as image storing the result of a computation. Concretely, compute shaders operate on different ranges that we explicitly define and that we can visualize as different parts or tiles of an image (with some caveats, of course, because another important aspect of writing compute shaders is defining a sensible distribution of work across a given dataset given the thread group ranges that we explicitly define). The ranges specify a thread group that operates on the same shared memory, in addition to the globally accessible memory that’s coherent to each work group.

L2 cache (i.e., global memory) not shown.

Essentially, each work group processes its own, say pixel, range, and at the end of a dispatch call, we’ll have essentially combined the results from each work group to give us a complete image.

The important thing to keep in mind is the relationship between compute shaders from a programmer’s perspective and the actual hardware manifestation in terms of how streaming multiprocessors (SMs) consist of some 2 to the power number of cores. In analyzing the figure above, it would make sense that we would constrain the work group size to something that’s distributed well over the GPU architecture (i.e., in 16×16 dimensions) because shared memory is really only coherent across an explicit number (e.g., 64 INT units and 64 FP units in the case of Turing) of cores.

That said, having an extremely well-defined work group partition is not the top priority because the GPU scheduler normally does a good job of mapping the work groups to the SMs. Programmers should focus on minimizing register hits, shared memory usage, and conditional branches. Though, it never hurts to emulate the underlying hardware in the most faithful manner possible.

The fact that compute shaders really lend themselves to sections of rendered quads and, therefore, post-processing really got me experimenting with transitioning some of my rasterization-based subroutines to more compute shader-oriented workflows. In particular, I recently replaced the ping-pong FBO that facilitated my variance shadow mapping program’s Gaussian filtering with compute shaders. The setup looks like this:

blurShader.use();

blurShader.setIVec2(G_DIR, glm::ivec2(1, 0));
glBindImageTexture(0, shadowMaps[0], 0, GL_FALSE, 0, GL_READ_ONLY, GL_RGBA32F);
glBindImageTexture(1, shadowMaps[1], 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);
glDispatchCompute(256, 256, 1);

blurShader.setIVec2(G_DIR, glm::ivec2(0, 1));
glBindImageTexture(0, shadowMaps[1], 0, GL_FALSE, 0, GL_READ_ONLY, GL_RGBA32F);
glBindImageTexture(1, shadowMaps[0], 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);
glDispatchCompute(256, 256, 1);

Here’s the corresponding shader:

#version 460 core
layout (local_size_x = 16, local_size_y = 16) in;

const float WEIGHTS[9] = float[]
    (0.0162162162f,
     0.0540540541f,
     0.1216216216f,
     0.1945945946f,
     0.2270270270f,
     0.1945945946f,
     0.1216216216f,
     0.0540540541f,
     0.0162162162f);
const int NUM_THREADS = 256;
const int NUM_TAPS = 8; // # taps minus the sample in the middle
const int RADIUS = 3;

layout (binding = 0, rgba32f) uniform readonly image2D gInput;
layout (binding = 1, rgba32f) uniform writeonly image2D gOutput;

uniform ivec2 gDir;

shared vec4 sTexels[NUM_THREADS + NUM_TAPS * RADIUS];

// 9-tap Gaussian filter
void main()
{
    uint texelIndex = gl_LocalInvocationIndex;
    ivec2 tc = ivec2(gl_GlobalInvocationID.xy);

    // The leftmost or bottommost texel at which we begin to sample
    ivec2 start = tc - NUM_TAPS / 2 * RADIUS * gDir;
    sTexels[texelIndex] = imageLoad(gInput, start);
    // Since we want the last threads in the workgroup to have coherent sample data
    // (i.e. its range of texels to sample from fully accounted for), we'll get
    // the first 8 * RADIUS (i.e. twice the number of samples extending from each
    // side of any given texel) of our threads to deal with those trailing edge cases
    if (texelIndex < NUM_TAPS * RADIUS)
    {
        sTexels[texelIndex + NUM_THREADS] = imageLoad(gInput, start + NUM_THREADS * gDir);
    }

    // Fundamentally, we are extracting all the relevant texel data for our workgroup,
    // and then getting each texel to operate with respect to the data in a highly
    // parallelized fashion
    barrier();

    vec4 sum = vec4(0.0f);
    for (int i = 0; i <= NUM_TAPS; ++i)
    {
        sum += sTexels[texelIndex + i * RADIUS] * WEIGHTS[i];
    }

    imageStore(gOutput, tc, vec4(sum.rgb, 1.0f));
}

I won’t go into the details of the setup or shader, but I will instead defer explanation of the major ideas to the next section, which returns to our initial discussion of tiled shading, specifically.

Tiled Deferred Shading

Now that we’ve established an understanding of compute shaders, all that remains is piecing together the various mechanisms that lend themselves to a coherent interaction between compute dispatch calls and normal framebuffer rendering.

First, we allocate space on the GPU for our lighting data through something called a shader storage buffer. The shader storage buffer object (SSBO) differs from uniforms in various ways, but most notably, it can store a whole lot more data than what we normally send over via uniform binding points. Generally, it’s useful to think of uniform data as variables that have their primary persistence on the CPU side of things and that can also potentially change over the course of a simulation. Conversely, SSBOs are ideal for things such as…thousands of light sources and their data.

Whereas vertex buffers (i.e., GL_ARRAY_BUFFER) are more about vertex data and the arrangement of specific binding points to that data in the vertex shader, shader storage buffers make no assumptions about the semantics of your data. You have to lend coherence to an explicitly allocated chunk on the GPU, like so:

void copyLightDataToGPU()
{
    glGenBuffers(1, &lightsBuffer);
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, lightsBuffer);
    glBufferData(GL_SHADER_STORAGE_BUFFER, pointLights.size() * sizeof(PointLight), nullptr, GL_STATIC_DRAW);

    glBindBuffer(GL_ARRAY_BUFFER, lightsBuffer);
    PointLight* ptr = reinterpret_cast<PointLight*>(glMapBufferRange(GL_ARRAY_BUFFER, 0, pointLights.size() * sizeof(PointLight), GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT));
    for (int i = 0; i < pointLights.size(); ++i)
    {
        ptr[i]._color = pointLights[i]->_color;
        ptr[i]._radius = pointLights[i]->_radius;
        ptr[i]._position = pointLights[i]->_position;
    }
    glUnmapBuffer(GL_ARRAY_BUFFER);
}

For reference, I set up my PointLight structure to use perfect memory alignments:

struct PointLight
{
    glm::vec3 _color;
    float _radius;
    glm::vec3 _position;
    unsigned int : 32;
};

Again, for the sake of memory access performance, it’s never a bad idea to allocate your memory in terms of the way bits are organized on the hardware.

Next, let’s assume that we’ve fully executed our G-buffer pass and have generated a bunch of textures (which are attached to specific framebuffer binding points in a secondary framebuffer) that we’d like to access in our compute shader. We bind the first level of each texture unit as an image (which is a distinct concept from a texture), like so:

cullLightsShader.use();
cullLightsShader.setMat4(phoenix::G_PROJECTION_MATRIX, utils->_projection);
cullLightsShader.setMat4(phoenix::G_VIEW_MATRIX, utils->_view);
cullLightsShader.setMat4("gInverseVP", glm::inverse(utils->_projection * utils->_view));
cullLightsShader.setVec3(phoenix::G_VIEW_POS, camera->_position);
glBindImageTexture(0, gBuffer->_FBO, 0, GL_FALSE, 0, GL_READ_ONLY, GL_RGBA32F);
glBindImageTexture(1, albedoSpecularMap, 0, GL_FALSE, 0, GL_READ_ONLY, GL_RGBA32F);
glBindImageTexture(2, output, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, lightsBuffer);
glDispatchCompute(160, 90, 1);

Note that we also needed to specify an explicit binding point for our previously generated lighting data that’s been placed on the GPU. Once that’s taken care of, we can set up an arrangement of 160×90 work groups that map cleanly to the resolutions of our G-buffer textures, as well as my screen’s resolution. That’s because given a 16×16 sized thread group or work group or tile, 160 work groups in the x-direction result in a processing of 2,560 pixels, and 90 work groups in the y-direction translate to 1,440 pixels.

Over on the GPU side and looking at things on a per shader invocation (i.e., pixel) basis, we see some familiar corresponding variables:

#version 460 core
layout (local_size_x = 16, local_size_y = 16) in;

const int NUM_LIGHTS = 4096;
const int MAX_LIGHTS_PER_TILE = 160;
const int WORK_GROUP_SIZE = 16;
const float SPECULAR_FACTOR = 16.0f;
const float SCREEN_WIDTH = 2560.0f;
const float SCREEN_HEIGHT = 1440.0f;

struct PointLight
{
    vec3 _color;
    float _radius;
    vec3 _position;
};

layout (binding = 0, rgba32f) uniform readonly image2D gNormalMap;
layout (binding = 1, rgba32f) uniform readonly image2D gAlbedoSpecularMap;
layout (binding = 2, rgba32f) uniform writeonly image2D gOutput;
layout (std430, binding = 3) buffer LightsBuffer
{
    PointLight gPointLights[];
};

uniform mat4 gProjectionMatrix;
uniform mat4 gViewMatrix;
uniform mat4 gInverseVP;
uniform vec3 gViewPos;

shared uint sVisibleLightIndices[NUM_LIGHTS];
shared int sNumVisibleLights = 0;

On the second line of our shader is a specification on the size of our work group, which consists of 16×16 or 256 threads. Across the entire space of dispatched work groups, this is going to map to a full quad of resolution 2560×1440.

std430 defines a specific memory layout for our lights buffer. You’ll also notice a couple of shared variables that are coherent only in the space of the current work group. Thus, no other work group has any idea of the exact light sources that we will have culled for our given work group because work groups can’t communicate with one another. The idea is to constrain each context to a specific, predefined tile so that the dispatch can process tiles in a highly parallelized fashion (which themselves will have threads execute in a highly parallelized, albeit controlled fashion).

Inside the main function, we have quite a few interesting points to discuss:

void main()
{
    vec2 center = vec2(SCREEN_WIDTH, SCREEN_HEIGHT) / float(2 * WORK_GROUP_SIZE); // Location of the middle work group
    vec2 offset = center - vec2(gl_WorkGroupID.xy);

    // Extract the viewing frustum planes (normals)
    // https://gamedev.stackexchange.com/questions/156743/finding-the-normals-of-the-planes-of-a-view-frustum
    vec4 column0 = vec4(-gProjectionMatrix[0][0] * center.x, gProjectionMatrix[0][1], offset.x, gProjectionMatrix[0][3]); // (-2 * n' / (r - l) * 80, 0, offset.x, 0)
    vec4 column1 = vec4(gProjectionMatrix[1][0], -gProjectionMatrix[1][1] * center.y, offset.y, gProjectionMatrix[1][3]); // (0, -2 * n' / (t - b) * 45, offset.y, 0)
    vec4 column3 = vec4(gProjectionMatrix[3][0], gProjectionMatrix[3][1], -1.0f, gProjectionMatrix[3][3]); // (0, 0, -1, 0)

    vec4 frustumPlanes[4];
    // Left
    frustumPlanes[0] = column3 + column0;
    // Right
    frustumPlanes[1] = column3 - column0;
    // Top
    frustumPlanes[2] = column3 - column1;
    // Bottom
    frustumPlanes[3] = column3 + column1;
    for (int i = 0; i < 4; ++i)
    {
        frustumPlanes[i] /= length(frustumPlanes[i].xyz);
    }

    int numThreads = WORK_GROUP_SIZE * WORK_GROUP_SIZE; // 16x16x1 sized thread group; 16x16 tile
    int numPasses = (NUM_LIGHTS + numThreads - 1) / numThreads; // (4096 + 16^2 - 1) / 16^2 = 16
    for (int i = 0; i < numPasses; ++i)
    {
        // [0-15] * 16^2 + gl_LocalInvocationIndex
        // Each thread/shader processes using the same relative offset (that is unique to it inside the work group) in the tile,
        // so that all of the threads in the work group collectively process all of the lights for their relevance to the tile
        uint lightIndex =  min(i * numThreads + gl_LocalInvocationIndex, NUM_LIGHTS - 1);
        PointLight pointLight = gPointLights[lightIndex];
        if (sNumVisibleLights < MAX_LIGHTS_PER_TILE)
        {
            bool inFrustum = true;
            for (int j = 3; j >= 0 && inFrustum; --j)
            {
                float distance = dot(frustumPlanes[j], gViewMatrix * vec4(pointLight._position, 1.0f)); // Distance of the point from the plane
                // https://gamedev.stackexchange.com/questions/79172/checking-if-a-vector-is-contained-inside-a-viewing-frustum
                inFrustum = -pointLight._radius <= distance;
            }
            if (inFrustum)
            {
                int mem = atomicAdd(sNumVisibleLights, 1);
                sVisibleLightIndices[mem] = lightIndex;
            }
        }
    }

    barrier();

    ivec2 texelSpaceTexCoords = ivec2(gl_GlobalInvocationID.xy);
    vec2 texCoords = vec2(texelSpaceTexCoords.x / SCREEN_WIDTH, texelSpaceTexCoords.y / SCREEN_HEIGHT);
    vec4 N = imageLoad(gNormalMap, texelSpaceTexCoords);
    vec3 P = calcWorldPos(texCoords, N.w);
    vec3 V = normalize(gViewPos - P);
    vec4 albedoSpecular = imageLoad(gAlbedoSpecularMap, texelSpaceTexCoords);

    vec4 fragColor = vec4(0.1f * albedoSpecular.rgb, 1.0f);
    for (int i = 0; i < sNumVisibleLights; i++)
    {
        fragColor += calcLighting(gPointLights[sVisibleLightIndices[i]], P, N.xyz, V, albedoSpecular);
    }

    imageStore(gOutput, texelSpaceTexCoords, fragColor);
}

The first thing that occurs is a fast extraction of the viewing frustum planes, or rather the normals of the view frustum planes that define the boundaries of our given tile. I won’t go too deep into the mathematics behind the method, but you can take a look at the link posted in the comments, as well as resources such as Mathematics for 3D Game Programming and Computer Graphics just to get a sense of what’s going on.

Once we have the normals, the next thing that we’ll want to do is essentially distribute the work group threads so that every single light source is accounted for with respect to our tile. You are free to use any number of heuristics to achieve that end, but I decided to map each thread to its respective offset in the work group. Then, for every multiple of the number of threads that we have in our work group, we process the light source at the index that’s at the same offset with respect to the multiple number. A somewhat brief subsequent calculation after that determines whether or not the light source intersects the inside of the frustum associated with our current tile, and if so, we add its index onto our per tile list of lights.

Once all the lights have been processed and we have a complete list of lights that are relevant to the pixels on our tile, we can then calculate the amount of lighting on our pixel (i.e., with respect to our shader invocation). We do that by iterating through the list of lights, which are represented as indices into the large buffer of lights that we had to process in a parallelized way earlier. The important thing to note here is that we have to wait until all threads finish processing the lights buffer. Otherwise, we risk using incomplete data in our calculations due to the non-deterministic nature of our parallel execution. We also have to image load/store our texture data in terms of texel space coordinates (i.e., 2560×1440) instead of the usual normalized device coordinates that we frequently use for fragment shaders (to be fair, we still do use NDC to extract the world position from our stored depth information).

That just about does it for this technique! Generally speaking, the main principles behind driving tiled shading with compute shaders are setting up a somewhat evenly distributed workload across the threads in the work group and defining the interactions with the shared memory in a way that reduces the hits on the L2 cache while also having the shared memory variables enforce a state that drives more efficient lighting calculations for our scene (or other things like a blur post-effect).

There are several reasons to apply light culling and utilize lists of lights for our pixel evaluations, but perhaps the most important is that each fragment in a work group evaluates the same list, ensuring a coherent execution of the GPU warps, which has performance ramifications. Other reasons include the potential reusability of the light lists, which might be useful for subsequent shading passes, and the ability to lower framebuffer precision due to the setup not having to keep track of the results from a large number of lights over numerous iterations.

Potential Optimizations

It’s always worth considering improvements to the culling routine, especially because our naive sphere/frustum test lends itself to false positives in certain situations. This is because our frustum emanates from a screen-space tile, so it’s often long, thin, and asymmetrical. To ameliorate the problem, we could add a sphere/box test after the initial test against the frustum planes, which would still be vulnerable to false positives but to a much lesser extent.

We could also consider clustered shading, which divides the view frustum into a set of three-dimensional cells or clusters. Whereas tiled shading subdivides a screen space quad, clustered shading also divides by z-depth slices. The difference between this approach and specific tiled shading algorithms that consider the minimum and maximum z extents is that clustered shading does not need rendered geometry to execute the culling because the z-depth subdivision is performed on the entire view frustum.

There are definitely a lot of options to move this implementation forward (perhaps even to forward+), and that just adds to the fun of it all!

Resources

Mathematics for 3D Game Programming and Computer Graphics
Real-Time Rendering
Andrew Lauritzen’s Deferred Rendering for Current and Future Rendering Pipelines
Johan Andersson on DirectX 11 Rendering in Battlefield 3
NVIDIA Turing GPU Architecture Whitepaper

One thought on “Adventures in Deferred Rendering: Tiled Deferred Shading and Compute Shaders

  1. Thanks for the nice explanation. Not having to use an L-Buffer is wonderful. From another explanation I read, I noticed that you can compute the tile grid bounds only when the screen resolution changes. This would require another compute and then inject the result into the final compute shader.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s