Tuesday, January 10, 2012

Safer Data Sharing Between Threads

[This article was originally posted on AltDevBlogADay]

Let's say we are given the task of optimizing a serial program to run on a multi-core system. In the process, we've isolated one block of the code where we can achieve a very nice speedup by delegating it to a second thread, thus allowing the main thread to get a head start on the next portion of work. At some later point in time the main thread then syncs to this worker thread before it uses the results. Since the original serial code is all working correctly, we're trying to be as minimally invasive as possible so as to not inadvertently break anything; we decide to keep all the data structures as they are and allow the worker thread to directly modify the single 'global' copy of the data. Also, since we're interested in performance, any fine grained, per variable, locking is frowned upon. All of this should be fine, we reason, since as the worker thread is modifying the data, the main thread won't need access to it, and we have a sync in place for when the main thread does eventually need access to guarantee the worker thread is done. Here is the scenario in code:

Main thread:

    // single shared variable here, but may be hundreds of variables in practice
    int shared_var = 2;
    uint32_t task = SubmitAsyncTask(DoWorkerTask);

    // main thread goes to work on next task...

    SyncOnTask(task);

    // prints 5
    printf("%d\n", shared_var);

Worker thread:

    DoWorkerTask()
    {
        shared_var += 3;
    }

All is well until a new feature request comes in months down the road which causes us to innocently modify some of the shared data during a portion of the frame which we really shouldn't have:

Main thread:

    // single shared variable here, but may be hundreds of variables in practice
    int shared_var = 2;
    uint32_t task = SubmitAsyncTask(DoWorkerTask);

    // main thread goes to work on next task...

    // BAD, we shouldn't be touching this variable during this portion of the frame
    shared_var = 5;

    SyncOnTask(task);

    // is this 5, or 8?
    printf("%d\n", shared_var);

Don't let the simplicity of this contrived example fool you, the same access pattern, when intermixed with another 100k lines of code, may not be quite so easy to spot.

From my experience, this is a pretty common pattern in games programming. It manages to completely avoid the overhead from any fine grain synchronization primitives, such as critical sections, when accessing data. Instead, it depends on an implicit contract between the main thread and the worker thread, which goes something like: "I, the main thread, promise not to modify any elements belonging to a small, well defined, subset of data which I will share with you, during any point in the frame that it is possible that you, the worker thread, may be transforming it. In return, you, the worker thread, may only modify this small, well defined, subset of this data and nothing else."

Already we see that this contract is pretty fragile at best, filled with good intentions, but with little in the way of actual concrete enforcement. Maybe it's documented with comments, maybe not. Either way, the code as it stands is tough to maintain and modify. Fellow co-workers are likely to dread the thought of going near it.

The problem is that it's non-obvious, from just looking at any block of code partaking in this contract, just what data it's allowed to touch at that point in time. It may even be quite tough to reason about when in a frame it even gets executed, and even trickier if that execution point jumps around from frame to frame, due to unpredictable events. Comments can only get you so far, ideally we'd like some compiler guarantees about what we can and can't access. I don't know how to make the compiler detect these cases, but the next best thing, runtime checking, is actually quite simple.

During some recent delving into what's been going on in the C++11 world with regards to concurrency (lambdas, you rock!), I came across this surprisingly straightforward solution to the problem (the relevant section is around 19:45). Overlooking the syntactical C++ sugar coating, the concept itself is very basic and translates to any language; let's simply transfer ownership of the shared data when we cross these thread boundaries. We can do this by determining what data actually needs to be shared, putting it into its own class/struct, and then ensuring that only one pointer can ever point to it at any one instance in time.

In C++11, the machinery for this is built-in by making use of std::unique_ptr and the move semantics which r-value references allow (not important for the understanding of this article). It's perhaps simpler to visualize in C though; taking the previous example and wrapping the shared data up, as described, might look something like the code below:

Main thread:

    struct Shared
    {
        int shared_var;
    };

    Shared shared;

    Shared *main_thread_access = &shared;
    Shared *worker_thread_access = nullptr;

    main_thread_access->shared_var = 2;

    uint32_t task = SubmitAsyncTask(DoWorkerTask);

    // ...

    // error: race condition, we'll likely crash at some point during
    // development/testing
    main_thread_access->shared_var = 5;


    // ...

    SyncOnTask(task);

    printf("%d\n", main_thread_access->shared_var);

Worker thread:

    void DoWorkerTask()
    {
        Shared *worker_thread_access;
        DataPointerAcquire(&main_thread_access, &worker_thread_access);

        worker_thread_access->shared_var += 3;

        DataPointerRelease(&main_thread_access, &worker_thread_access);
    }

Notice how the conditions of the contract between the main and worker thread are now more explicit. We can see which data is intended to be shared between threads since they're accessed through specific pointers. We can even search in a text editor for all the spots in a program in which we are altering any particular block of shared data. Furthermore, we've introduced the concept of data ownership. Either the main thread or the worker thread owns the data at any point in time. If the main thread makes any attempt to modify a shared variable whilst the worker thread owns it, we'll dereference a null pointer and can readily see in the debugger where the race condition happened, rather than having to backtrack from the artifacts the race leaves behind at a later part in the frame (on someone else's machine; on a bug which can only be reproduced once every 10 hours of testing; on a final code built; oh, and we have to hit gold by next week; just sayin').

The DataPointerAcquire function should be called on the thread which wants to take ownership of the shared data. Later, a matching call to DataPointerRelease should be also called from the same thread when the threads want to relinquish ownership. Those functions might looks something like:

    // called on the thread which wants to take
    // ownership of the shared data
    template <class T>
    inline void DataPointerAcquire(T **old_owner, T **new_owner)
    {
        assert(*old_owner);
        T *ptr = *old_owner;
        *old_owner = nullptr;
        MemoryBarrier();
        *new_owner = ptr;
    }

    // called on the thread which currently owns the shared data
    // (i.e. the thread which issued the previous DataPointerAcquire)
    template <class T>
    inline void DataPointerRelease(T **orig_owner, T **thread_ptr)
    {
        assert(*orig_owner == nullptr && *thread_ptr);

        // make the computation results visible before releasing our ownership
        MemoryBarrier();

        *orig_owner = *thread_ptr;

        // make the master pointer (orig_owner) visible to all other threads
        MemoryBarrier();

        *thread_ptr = nullptr;
    }


You may have noticed there is now an extra deference required to access the data. This, however, seems a small price to pay for the additional peace of mind and maintainability, both for you, and your co-workers.

Tuesday, June 21, 2011

How a DSLR could help to take better handheld photos

I got a canon digital Rebel XTi back in 2007, mainly to take family photos. It wasn't a coincidence that it happened to be the year my daughter was born. More recently however I've started to take more interest in the subject of photography. It's a great hobby for all the graphics programmers out there to build up their rendering intuition since there's no faking it - this is how light actually works!

In common with graphics programming, much of photography is an optimization problem. Most of the time we don't have as much light hitting the sensor as we'd like, so the problem becomes how do we optimize the ISO, shutter speed, and f-number settings to give the best quality image free of camera shake. It may be that the Rebel XTi being a low end camera doesn't have all the wizbang features that the latest and greatest have, but I find the procedure of setting up the optimal combination of these 3 settings kind of haphazard to say the least. How many times has what could have been a great shot turned out to be a little too "soft"? There a powerful computer inside of the DSLR in my hands, why can't it better help me determine the optimal settings, much in the same spirit that it can auto-focus for us. I could use manual focus myself all the time but it's usually faster and more accurate.

I've been thinking a bit about how the situation could be improved lately and here is one idea. There are many photographic scenarios, but the one I find myself in the most is the case where I'm shooting handheld and light isn't abundant. The most common example would be taking indoor pictures. I try not to use the flash to avoid the harsh look it gives. So the question is then, given that I don't have all the light I would like, and am shooting handheld, what are the optimal combination of those 3 settings (namely, ISO, shutter speed and f-number)?

What I do currently for indoor handheld photos is set the ISO to something like 400 and the camera mode to Av (aperture priority). This allows me to set the focal length and aperture based on how I want to compose the image, which fully dictates the f-number the camera uses. The only variable left is the shutter speed which the camera picks based on how much light it meters in the scene. This is quite an in-optimal arrangement. It'll happily pick relatively long shutter speeds leaving me with blurry images which are only fit for the trash. A much more preferable scheme would be a new "handheld" mode. It might work like this:

  1. I tell the camera I'm shooting handheld (using a new setting on the mode dial).
  2. I setup a focal length and aperture based on my composition desires. This will result in a f-number which is usable by the camera.
  3. The camera computes a "smart" shutter speed which will keep the image sharp for the average handheld user. 

But what is a good value for this "smart" shutter speed? A typical heuristic for handheld shutter speeds is 1 / focal length. For example, if shooting at 100mm, then the shutter speed should be 1/100 seconds. This is generally regarded at a higher bound on the shutter speed, meaning that we should really pick a value where the shutter is open for slightly less time. The heuristic also assumes a full frame sensor so we need to take the crop factor of the camera into consideration. So, in this example, we'd need less than 1/160 second for a rebel (since it has a crop factor of 1.6). Also if we are using a lens with image stabilization or vibration reduction, then that should be taken into account here also (let's see how IS really measures up to the 3 to 4 stop manufacturer claims.)

Once a shutter speed is chosen that will stack the odds in our favor against camera shake, then the camera picks an ISO - this fills in the last remaining parameter. The settings menu should allow the user to set the maximum permissible ISO and if the camera needs to pick an ISO higher than allowed, it just won't take the picture (much like it doesn't take a picture when it can't focus). So why have ISO be the final parameter as opposed to the first (as with Av mode). Well, there are two reasons which spring to mind:

  1. I can do more with a photo with high noise levels than a blurry photo. In both cases we're missing information and can't fully recover the original source data. But algorithmically, we can do a better job removing a certain amount of noise from an image than we can by trying to make blurry image sharp, which in most cases is a lost cause.
  2. Higher-end, more expensive DSLRs respond much better at high ISO settings than cheaper ones. A blurry photo is a blurry photo on any camera, but a noisy photo on a cheap camera may be a clean photo on an expensive one.

So how would this work in practice? Wikipedia has a good page on exposure value here. When a camera meters a scene it effectively comes up with a single exposure value (EV), which is a number the 3 camera settings of interest must combine to produce. Plotted logarithmically in 2D, with time on one axis, and the f-number on the other, we get diagonal lines of constant exposure value as seen here.

If we extended this concept into 3D by adding an extra dimension for ISO (which is probably what cameras are doing anyway I'd guess?), we'd get planes of constant exposure instead. A given DSLR computed EV value would naturally dictate the plane we need to use. This together with an f-number from the user would allow the camera to compute a "smart" shutter speed, set the resultant ISO and voila.

Sports photography, where we want to freeze motion, could also benefit from a variation on this mode. Instead of computing an appropriate handheld shutter speed, the camera would use a speed suitable for whats being shot - say 1/1000th second for a football game. The final step would remain the same and an appropriate ISO value would be computed.

This is a relatively new subject to me, do cameras already do this perhaps?

[Edit: it's been pointed out that factors such as handheld skill, fatigue, and environment are also factors affecting optimal shutter speed. I think these would best fit into a single shutter speed compensation setting (in much the same spirit as cameras currently have an EV compensation setting).]

Thursday, February 24, 2011

Improved Normal-map Distributions

[Note, this was originally posted here on AltDevBlogADay]
Many moons ago at Insomniac, we used to use a partial derivative scheme to encode normals into texture maps. Artists complained that it was having detrimental effects on their normal maps, and rightly so. Then one day, our resident math guru Mike Day turned me onto this little trick in order to make amends. I hadn’t come across it before so thought it might be worth sharing.
A common technique for storing normals in a texture map is to compact them down to a 2 component xy pair and reconstruct z in the pixel shader. This is commonly done in conjunction with DXT5 compression since you can store one component in the DXT rgb channels and the other in the DXT alpha channel, and they won’t cross pollute each other during DXT compression stages. DirectX 10 and 11 go a step further and introduce a completely new texture format, BC5, which caters for storing 2 components in a texture map explicitly.
The code to reconstruct a 3 component normal from 2 components typically looks like this:
float3 MakeNormalHemisphere(float2 xy)
{
    float3 n;
    n.x = xy.x;
    n.y = xy.y;
    n.z = sqrt(1 - saturate(dot(n.xy, n.xy));
    return n;
}
Graphically, what we’re doing is projecting points on the xy plane straight up and using the height of intersection with a unit hemisphere as our z value. Since the x, y, and z components all live on the surface of the unit sphere, no normalization is required.



Is this doing a good job of encoding our normals? Ultimately the answer depends on the source data we’re trying to encode, but let’s assume that we make use of the entire range of directions over the hemisphere with good proportion of those nearing the horizon. In the previous plot, the quads on the surface of the sphere are proportional to the solid angle covered by each texel in the normal map. As our normals near the horizon, we can see these quads are getting larger and larger, meaning that we are losing more and more precision as we try to encode them. This can be illustrated in 3D by plotting the ratio of the differential solid angle over the corresponding 2D area we’d get from projecting it straight down onto the xy plane. Or equivalently, since the graph is symmetrical around the z-axis, we can take a cross section and show it on a 2D plot.


The vertical asymptotes at -1 and 1 (or 0 and 1 in uv space if you prefer) indicate that as we near the borders of our normal map texture, our precision gets worse and worse until eventually we have none. Your mileage may vary, but in our case, we tended to use wide ranges of directions in our normal maps and would like to trade off some of the precision in the mid-regions for some extra precision at the edges. Precision in the center is the most important so let’s try and keep that intact. What options do we have? Well it turns out we can improve the precision distribution to better meet our needs with a couple of simple changes…
Nothing is constraining us to using a hemisphere to generate our new z component. By experimenting with different functions which map xy to z, better options become apparent. One function we have had success with is that of a simplified inverted elliptic paraboloid (that’s a mouthful, but is more simple than it sounds).
z = 1 – (x*x + y*y)
This is essentially a paraboloid with the a and b terms both set to 1, and inverted by subtracting the result from 1. Here is a graphical view:



Notice that we don’t have the large quads at the horizon anymore and overall, the distribution looks a lot more evenly spread over the surface. Wait a second though… this doesn’t give us a normalized back vector anymore! Fortunately the math to reconstruct the normalized version is roughly the same as before in terms of cost. We need to do a normalize operation but save on the square root.
float3 MakeNormalParaboloid(float2 xy)
{
    float3 n;
    n.x = xy.x;
    n.y = xy.y;
    n.z = 1 - saturate(dot(n.xy, n.xy));
    return normalize(n);
}
Below, the overlaid green plot shows the angle to texture area ratio of the new scheme. The precision is still highest in the center where we want it, and we’re trading off some precision in the mid-ranges for the ability to better store normals near the horizon.



Looking at the 3D graphs, another observation is that we’re wasting chunks of valid ranges in our encodings. In fact, a quick calculation shows that ~14% of the possible xy combinations are going to waste. A straightforward extension to the scheme would be to use a function which covers the entire [-1, 1] range in x and y to make use of this lost space. There are an infinite amount of such functions, but a particularly simple one is this sort of dual inverted paraboloid shape.
z = (1-x^2)(1-y^2)
which looks like:



One downside of this is that we need a little more math in the shaders in order to reconstruct the normal. Because of this, on current generation consoles, we went with the basic inverted paraboloid encoding. Going forward though, with the ratio of ALU ops to memory accesses still getting higher and higher, the latter scheme might make more sense to squeeze a little extra precision out.
Finally, here is the code used when generating the new normal map texels (the initial circular version not the second square version). The input is the normal you want to encode and the output is the 2D xy which is written into the final normal map.
Vec2 NormalToInvertedParaboloid(Vec3 const &n)
{
    float a = (n.x * n.x) + (n.y * n.y);
    float b = n.z;
    float c = -1.f;
    float discriminant = b*b - 4.f*a*c;
    float t = (-b + sqrtf(discriminant)) / (2.f * a);
    Vec2 p;
    p.x = n.x * t;
    p.y = n.y * t;
    return p;                              // scale and bias as necessary
}

Monday, January 24, 2011

Let’s get physical


[Note, this was originally posted here on AltDevBlogADay]
Not just a catchy Olivia Newton John slice of the 80′s but also a motto I find myself increasingly trying to incorporate into my professional work life these days. Before our HR dept. comes down on me, let me elaborate. A conversation I had with one of our cinematic guys last year went something like this:
Me: “What do you mean you needed to remake all the shaders on that character for the real-time cinematic?”
Him: “Well, we added some extra lights for the cinematic but then that made the specular response too bright, so I tweaked the shaders a little so they looked good, but now they don’t look good in game anymore.”
Me: “Hmmm, that’s not good.”
The issue in this case was that as he was fine tuning the specular response on these surfaces, the renderer was inadvertently adding light energy into the system, in effect making those shaders “hardcoded” to that particular lighting environment. The solution involved examining how we handled light in our shaders and reworking the process using physical principles. Essentially, by not allowing surfaces to reflect more light than is incident on them, we, for a large part, have been able to avoid this problem. In hindsight, this sounds very simple but physically based approaches can have deep implications throughout a rendering pipeline.
It’s all about context
As with almost every other area of programming, one of the particularly useful questions to ask is “what’s my context?”, and this is no different. So this discussion only applies to games where our goal is to create authentic renderings of reality, which seems to be the majority of 3D games, and is especially the case with AAA titles. So the points here may not apply if your goal is non-photorealistic rendering.
Why should we do it
Here are some reasons why physically based rendering is a desirable property of your rendering engine (assuming reality is your goal). Of course, a true physical basis isn’t always achievable in practice but making decisions based on a physically motivated thought process is a very powerful tool indeed.
  •  We need to start somewhere. That somewhere might as well be physically based rather than arbitrary.
  • Author once, reuse everywhere. In nature, if we move an object from an indoor environment to an outdoor environment, it just looks right. You can’t make it look wrong. This is a very desirable goal for your renderer also.
  • Games are getting bigger. The manpower needed to make an AAA title is huge and we need to be smarter about asset creation. In practical terms, this means exposing less sliders for artists to tweak. The more sliders you expose, the larger the space of possible permutations they can use to get a desired result, and the more onus on them to hone in on the optimal combination. On top of that, if one slider needs to be tweaked depending on the value of another slider, as is commonly the case, then these kinds of dependencies can add up to a lot of extra work over the course of a project. One goal therefore would be to remove the bad, un-useful permutations and just expose useful sliders and ranges to work with. One solid way to accomplish this is to base your rendering on reality, and let nature fill in most of the blanks.
  • We can quantify rendering algorithms. Almost everything we do rendering wise in games is an approximation of what’s going on in reality. At the end of the day, we’ve got to still hit 30/60fps and so we strive to make the smartest tradeoffs. I’ve had discussions with co-workers where we’re intuiting about the pros and cons of a particular rendering algorithm. Both of us would have a mental picture in our heads of what the results would look like but in a hazy abstract kind of way. Well if you can compare it to ground truth images (which I’ll talk about later) then it’s much easier to visualize and quantify the short comings of different approaches.
  • Going forward, as processing power increases, real-time renderers are only going become more and more physically based. With this generation of consoles, we’ve made large strides forward in achieving realism due to having the horsepower to incorporate more physically based behavior than we could in the past. For example, linear light computations and HDR buffers are now commonplace. Even with these current-gen consoles, we’ve already seen some progress toward dynamic global illumination solutions. SSAO convincingly mimics short range occlusion. Crytek’s light propagation volumes and Geomeric’s Enlighten spring to mind also. Next generation is where we can expect this type of tech to really take center stage.
But what about artistic expression?
This is a common reaction people sometimes bring up when discussing physically based rendering and I totally see their point. In the past, sliders could go up to 11 and beyond, and sometimes that can be just what a particular scene in the game needs. So to be clear, I’m not necessarily advocating we don’t allow that but just that we’re aware that with this freedom comes extra responsibility. The pros and cons of anything which prevents reusability, such as the specular example above, should be seriously considered. The higher frequency which we could potentially reuse an asset, the more consideration it should be given.
It can be instructive to look to the world of film for where the creative limits we could hope to achieve via physically based rendering might be. It’s plain to see that a film like The Matrix has a very different look and feel than something like a Wallace and Gromit film. Yet both are results of filming real light bouncing off of real surfaces through real cameras. Animated films from the likes of Pixar and Dreamworks, and games such as littleBigPlanet, are very much physically motivated in terms of rendering. Yet, contrast these to the latest Hollywood action blockbuster and we can see there’s actually quite a wide creative palette at our disposal (as shown below), all without violating (well maybe we’re violating, but hopefully not abusing) physical laws. Post effects such as motion blur and depth of field, as well as cinematography / lighting setups will play a much larger part in achieving a desired mood going forward.






















Examples of physically based scenes, (1) the stylized post processing of Sin City, (2) stop motion in Wallace and Gromit, (3) synthetic image creation in UP, and (4) runtime global illumination in littleBigPlanet.
So, yes, control is diminished in an absolute sense of what outputs you get from the lighting pipeline, but hopefully we are removing a large chunk of the non-desirable outputs, leaving primarily the desirable outputs remaining.
First steps toward physical rendering
When I first embarked down the path of physical correctness, my initial approach was to be very thorough when implementing lighting related code, to make sure equations and units were done “by the book” at each stage and leave it at that. In hindsight, this is the wrong way to approach the problem. I never knew for sure if there was a bug or not since I had no concrete frame of reference. Even if something looked right, I didn’t know definitively if itwas right.
A much easier and more robust approach is to simply compare your results to the “ground truth” images (akin to running regression tests on a codebase). A ground truth image is a rendering of what the scene should look like, that is, if you were to remove all resource constraints and try to compute the image to the best of our working knowledge of light (*).For this purpose, I recommend taking a look at the book Physically Based Rendering - From Theory to ImplementationI was drawn to it since the full source code is freely available and impeccably documented. As well as being thoroughly modern, it has a nice simple scene description language and obviously, as implied by the name, the principles and units are physically based throughout.
I plan to delve deeper into the specifics for my next post. I’ll take a closer look at the rendering equation and walk through the process of how we can isolate each of the various terms for purposes of comparing it with what your runtime renderer is doing. There are some prerequisites for a renderer before even starting down this path however. All shader computations should be done in linear space using HDR render targets where appropriate. You can find a good AltDevBlogADay post on this topic by Richard Sim here. Essentially we can store data whatever way makes sense, but we want to convert it so we’re that dealing with conceptually linear data in the shader stages of the pipeline.
Additionally, a goal is to have inputs authored or generated in known radiometric units. For example, background environment maps would ideally be in units of radiance, convolved diffuse cubemaps would ideally be in units of irradiance, point and spot lights ideally in units of intensity, etc. Hopefully the conversion to such units should fall out from the runtime to ground truth comparison process. And yes, the above steps can amount to a tremendous amount of work and pipeline modifications if you’re not already set up for it.
Traditional gamedev rendering wisdom states that if it looks good, then it is good. Well this is only part of the story; I would suggest that if it looks good and scales to large scale production, then it is good. Until next time…
(*) In practice, even offline renderers don’t account for all physical phenomena due to negligible cost/reward benefits. For example they generally collapse the visible light spectrum down to 3 coefficients and ignore polarization effects, amongst other things. This is totally fine for our purposes.