Monday, 28 May 2012

Steve, Otto and the green-screen

This post is not about MTF Mapper or lens sharpness. This story is about Steve.

And about chroma keying, commonly called "green screen" or "blue screen" effects. This technique is mostly needed by Hollywood. You cannot have an actor drive a car while having a spirited argument with another actor, all the time whilst driving in actual traffic. Nor can you quickly pop out to outer space to get a few shots of the starship Enterprise.

The typical solution is to capture the footage of your foreground objects, typically the actors or your model starship, against a uniformly coloured green background. Then you assign a transparency value (called an alpha channel) to each pixel in your image so that where the green background is showing through, you have full transparency. Hopefully the pixels that comprise the actor will have zero transparency, but you have to allow for partial transparency in the actor's hair, for example.

The problem can thus be stated as follows:
The green screen problem: Given a shot of the foreground object (actor or whatnot) against a uniform green (or blue) background, compute the correct transparency value for each pixel in the image.

Of course, you have to ensure that your foreground object does not have any parts that are close to the background colour, or these parts will become transparent.

James Blinn (and co-author A.R. Smith, in a paper titled "Blue screen matting") have shown that the green screen problem is in fact an underdetermined problem, meaning that you have more variables to determine than you measurements. In the simplest case, you know the background colour, in RGBA format, (Rb, Gb, Bb, 1.0). Note that the background is fully opaque, so its alpha is exactly 1.0. The foreground object's colour is (Rf, Gf, Bf, Af), where it is assumed that the RGBA values have already been pre-multiplied by the alpha (opacity) value. This means that the colour of a pixel in our captured image is simply

(Ri, Gi, Bi, 0.0) = (Rf, Gf, Bf, Af) + (1-Af)*(Rb, Gb, Bb, 1.0)

We can break this down to give

Ri = Rf + (1-Af)*Rb

and similarly for the other two colours. This gives us three equations with four unknown values (Rf, Gf, Bf, Af), which cannot be solved uniquely.

So, despite the fact that the maths shows you that green-screen techniques cannot work, Hollywood insists on using it! Of course, if you are willing to add some assumptions (constraints) to the above equations, you can find some unique solutions. One of these constraints is that the foreground colour is distinct from the background colour, e.g., no green actors. Even under such limiting assumptions we often find that the results are not really all that great; just think of the flying-on-the-broomstick scene from Harry Potter and the Order of the Phoenix movie.

Blinn and Smith have shown that unique solutions can be obtained by capturing the foreground object against two different backgrounds. You also have to capture an image of each background without any foreground objects. Although this is somewhat of a bother, it allows you to obtain much better results in practice. Remember that the foreground object opacity, Af, is same in the two images against different backgrounds, but that the apparent foreground object colour (Rf, Gf, Bf) will not necessarily be the same if the object is not fully opaque. Combining the equations from both images thus gives us six equations with six unknowns, which fortunately will have a unique solution, provided that the two backgrounds really are different for each pixel.

The solution suggested by Blinn and Smith goes as follows:
Let D = (Ri,1 - Ri,2, Gi,1 - Gi,2, Bi,1 - Bi,2)
where  Ri,1 denotes the red value of our pixel of interest in image 1, and Ri,2 denotes the red value of the same pixel in image 2. In other words, D is the difference between the two input images containing the foreground objects against the two different backgrounds.
Similarly, let E = (Rb,1 - Rb,2, Gb,1 - Gb,2, Bb,1 - Bb,2)
denote the difference between the two background-only images, without the foreground objects, i.e., the pair of background-only shots you have to capture as part of this method.
The alpha value for a pixel is thus given by
Af = 1 - (D · E) / (E · E)
where  (D · E) denotes the dot product between the two vectors.

And that is all there is to this method; you can reconstruct the foreground object's red component as
Rf = (Ri,1 - (1.0 - Af)*Rb,1)
and similarly for the other two colours.
Now you can recombine the foreground objects with a new background as
Rn = Rf + (1.0 - Af)*Rk
where Rk denotes the red component of the new background image. Just apply the same pattern for blue and green components, and you are done. Remember, Rf is pre-multiplied with Af, which explains why the re-blending with a new background appears in this form.

How well does this work? I called in the help of Steve and Otto to demonstrate. Here is the first shot against background 1 (click for a larger version; applies to all images in this post):
Input image 1, shot against background 1
This image contains several interesting things. Note that there is quite a bit of green in the foreground objects (duck's head, for example). The champagne flute is also a bit tricky, because we can see most of the background right through it. Otto's fur would have presented endless trouble if you tried to remove the background manually in Gimp, for example. Yes, the dog is called Otto (Steve is the lime).

And the second shot:
Input image 2, shot against background 2
Lastly, a shot of each of the backgrounds, after physically removing the foreground objects from the scene.
background 1

background 2

I pushed these four images through a quick-and-dirty implementation of Blinn and Smith's method that I cobbled together in C++ using OpenCV. Here is the resulting alpha mask produced by the program without any manual intervention:
Alpha mask produced using Blinn and Smith's method
Note how solid Steve's alpha mask is --- the fact the he's green, and one of the backgrounds happened to be green, made no difference whatsoever. Otto's interior is also completely opaque, but on the edges we see some fine details like these:

100% crop of the alpha mask near Otto's head







Notice how the fine fibres are partially transparent (top right), and how the interior gaps in the fur is also partially transparent (bottom left-ish).

The champagne flute has also been extracted quite nicely.

Now for the final test: to recombine the foreground objects with a new background. Here is the background image:
PovRay's benchmark scene should do nicely as a new background
And here is the blended result:
Recombining the extracted foreground with a new background
The new composition is essentially flawless. There are no tell-tale fringes or other signs that are commonly seen with green-screen techniques.

Here is a close-up of Otto's head after blending:
100% crop of Otto's head after blending

Of course, if you concentrate a bit, you will see something is amiss. Although the champagne flute has been composited perfectly, it is quite clear that Steve (the lime) is correctly refracted through the right-hand side of the glass, but that the PovRay logo in the background suffered no distortion on the left side of the glass. This is obviously a shortcoming of all such recomposition techniques, so I guess the only solution is avoid strongly-refractive foreground objects.

Limitations

Obviously it is somewhat inconvenient to have to shoot the subject against two different backgrounds. It will be downright impossible to use this technique with a toddler, for example. An adult might be able to hold still enough, provided you use something like a television to display your background like I did above.

This technique is fine for inanimate objects, though.

Oh yes, of course you have to keep the camera very, very still during the whole capture process. I used a sturdy tripod + mirror lock-up + IR shutter release. In theory, you could use image-to-image registration to correct for camera movement, but I have not tried it yet for this problem.

I recently saw another paper (published in the proceedings of SIGGRAPH'06) by McGuire and Matusik where they extend this technique to live video. The secret is to use a polarised background material, which will obviously present different images through different polarisations of the light. A special camera uses a prism to send the incoming light to two different sensors according to its polarisation. Although this means the technique falls into the category of "requires special tools", it is still pretty cool. And it produces perfect results, just like the method above. So why didn't they use this for the dreaded broomstick scene in Harry Potter and the Order of the Phoenix?

Notes:

No animals were harmed during the production of this article. Steve, however, did not make it. That was some good lemonade, though!

Sunday, 27 May 2012

Pixels, AA filters, Box filters and MTF

What happens to a sensor's MTF curve when you remove the Anti-Aliasing (AA) filter (also called the Optical Low-Pass Filter, or OLPF)? I have chosen a rather long path to reach the answer, so this article is a bit longer than ideal, but I hope that you will agree that it is worth the effort to slog through it :)

To simplify the discussion, I am going to limit myself to grayscale sensors, e.g., the Leica M-Monochrom, or a lab camera such as a Prosilica. With no Bayer filter, there is little reason to add an AA filter to a sensor ... or is there?

First up we have to consider how such a grayscale sensor works. The sensor consists of a grid of photosites. The distance between the centres of two neigbouring photosites (left-right or top-bottom neighbours) is called the pixel pitch. In the ideal case, the pixel can be thought of as a little square with side lengths equal to the pixel pitch. In the real world, there has to be some space between adjacent pixels, and some space for circuitry to facilitate the read-out of a photosite. The fraction of the pixel that is actually collecting photons is called the fill factor. In the old days, this meant that only about 50% or so of your total sensor area was actually collecting photons, but with advances such as back-side illumination and microlens arrays we are doing much better today at about 90% fill factor.

Real world notwithstanding, it is convenient to think of a photosite as a little square, and a sensor as a grid of tightly packed photosite with practically no gaps between them.

This abstraction yields a very specific point spread function shaped like this:
The intensity of a grayscale sensor pixel is thus the sum of all the light collected over the surface of the photosite.

Enter the box filter

A 1D box filter is a filter that looks like this:

Clearly, the point spread function of an ideal square photosite is just the 2D version of a box filter.

The box filter pops up many times in signal processing contexts. It also crops up implicitly in unexpected places, but more on that later.

At this point I have to assume that you are familiar with convolution (check out the Wikipedia entry for a refresher). The convolution of a box filter f with itself (f * f) yields a triangular function like the black curve in this plot:
The red curve represents repeated convolution of a box filter with itself three times (((f * f) * f) * f), and the green curve 5 times. And yes, you guessed correctly, it does look a little like a Gaussian distribution. This should not surprise you any more.

I shall state without proof that repeated convolution of a box filter of width 1 pixel with itself does indeed converge on a Gaussian function with standard deviation of 1/√2 ≈ 0.7071.

The unexpected upshot (well, I certainly was not expecting it the first time) of this is that you can approximate a Gaussian blur by repeatedly blurring an image with a box filter. How do you blur an image with a box filter?

Well, convolution of an image with a 2D box filter that is 3x3 pixels in size is equivalent to replacing each pixel's value with the unweighted average of the group of 3x3 pixels centred at the pixel you are replacing. In this case, repeated convolution with a 3x3 2D box filter will approximate a Gaussian blur with a standard deviation of 3/√2 ≈ 2.12.

This observation, namely that repeated convolution with a box filter converges on a Gaussian, will be recycled in a later post, so do not page it out of your main memory just yet ...

Area-weighted Anti-Aliasing in computer graphics rendering

Rendering an image of a black square on a white background is relatively straightforward. You scan through each pixel, and test whether the current pixel is inside the square (pixel set to black) or outside (pixel set to white). This produces perfect images as long as the square is aligned with pixel boundaries, which is rather limiting.

If you rotate your black square relative to the pixel grid, a third possibility arises: a pixel might fall partially inside the square, and partially outside. The most intuitive solution to the problem is to simply set the intensity of the pixel proportional to the degree to which it is inside the square. In computer graphics terms, you clip the square to the boundaries of your pixel, and measure the area of the clipped polygon. Here is a picture:


This allows us to draw the black square with much smoother-appearing edges, since we are effectively using our radiometric resolution (different shades of gray) to compensate for our inadequate spatial resolution (large pixels) --- a practice commonly referred to as Anti-Aliasing in computer graphics.

We can also view this from a sampling perspective. First, we generate a number of (x,y) coordinates within each pixel, preferably well-spread throughout the current pixel. By counting how many of these coordinates are inside the square too, we have estimated the fraction of the current pixel that is covered by the square, which, in the limit, is identical to the fragment area that we obtained by clipping the square to the pixel boundaries above.

Lastly, we can consider the case where we sample at coordinates outside of the current pixel too, but that we apply a weight to each sample. If the weight is zero everywhere outside the current pixel, and exactly 1.0 inside the current pixel, then we can say that our weights represent the point spread function of the current pixel, which happens to be a 2D box function by construction.

This shows us that there are three equivalent methods of obtaining the correct intensity of a pixel that is only partially covered by the square:
  1. We can compute the area of intersection between the current pixel's boundaries and the square, or
  2. We can perform unweighted sampling at points spread uniformly throughout the current pixel, or
  3. We can perform weighted sampling at arbitrary points, and weight each point relative to the box filter centred at the current pixel (obviously we will improve our efficiency by only sampling close to the current pixel).
By now you can see the link between Area-weighted anti-aliased rendering and an image sensor with square pixels. If our sensor is observing a knife edge target (or just a black rectangle on a white background), then the intensity of each photosite will be proportional to the area covered by the black square, i.e., the sensor is implicitly applying a box filter while sampling the real-world scene.

MTF of a box filter

The point spread function of a non-Bayer sensor without an AA filter is simply a box filter of width equal to the pixel pitch. The point spread function of a synthetic image generated with an area-weighted sampling algorithm is also a box of width equal to the pixel size, which means we can use the one to study the other. So what does the MTF of this box function look like?

Again, I shall state without proof that the Fourier transform of the box function is sinc(f), or sin(f)/f. Thus, if our point spread function is a box function, then our MTF will simply be sinc(pi*f)/(pi*f), which looks like this:
A couple of things are important about this MTF curve (black curve):

  • Recall that in this case, a frequency of 0.5 cycles per pixel implies that our image contains a pattern of alternating black-and-white stripes that are exactly one pixel wide; together, one black stripe and one adjacent white stripe makes one cycle. This is the highest spatial frequency that can be represented correctly in our image --- if you try to make the stripes less than one pixel wide, then clearly you will not be able to preserve details exactly any more.
  • Note that the contrast drops to zero at twice the Nyquist frequency (1 cycle per pixel). When our stripes are exactly half a pixel wide, then the black and white pattern cancels exactly, leaving the entire image at 50% grey, hence zero contrast.
  • Also note that the box filter MTF curve is nonzero between 0.5 cycles per pixel and 1 cycle per pixel, which is highly undesirable. Aliasing is when a high frequency component masquerades as a lower frequency component (more on this below). As mentioned above, if the width of each stripe in a black-and-white pair falls below 1 pixel, then we can no longer represent this accurately in our image. 
  • At each integer k cycles per pixel, we can fit k pairs of black-and-white stripes inside one pixel, again cancelling each other and producing a 50% grey image. This is exactly where the MTF curve drops to zero each time.
  • The dashed grey curve is the MTF of a Gaussian PSF with a standard deviation 0.568 pixels (MTF50=0.33), included for comparison.
The MTF curve of the box filter describes exactly the MTF of a synthetic image rendered using Area-weighted AA (as described above), and it will be a good model of the MTF of a grayscale image sensor in the absence of the lens MTF (i.e., no diffraction).

Aliasing in practice

I have alluded to the fact that bad things happen when your scene contains detail that occurs at a higher frequency that what the sensor can capture, i.e., when you have details smaller than the pixels of your sensor. The type of aliasing I would like to highlight here is folding, which is when the high frequency information wraps around and re-appears as lower frequency information. To understand what it looks like, consider the following image:


This image does not exhibit any aliasing; it is merely to establish what we are looking for here. Firstly, the left panel is a stack of four sub-images (rows) separated by white horizontal bars. Each sub-image is simply a pattern of black-and-white bars, with both black and white bars being exactly 5 pixels wide. The four stacked sub-images differ only in phase, i.e., in each of the four rows the black-and-white pattern of bars is offset by a horizontal distance between 0 and 1 pixels in length.

The right panel is a 2x magnification of the left panel Note that the third row in the stack is nice and crisp, containing only pure black and pure white. The other rows have some grey values at the transition between the black and white bars, because the image has been rendered with box-filtered anti-aliasing.

Here is the same image again, repeated to simplify comparisons further down:
Box filtered, bars are 5 pixels wide

Note that the edges of the bars are exhibiting the classical box-filtered anti-aliasing patterns; depending on the exact position of the bars relative to the pixels (i.e., the differences between the four rows), we see different shades of grey at the transition, between the four rows. Contrast that with the same pattern, but rendered using a Gaussian filter (the one depicted in the dashed grey curve in the MTF plot above), rather than a box filter, to reduce aliasing:
Gaussian filtered (standard dev. = 0.568 pixels), bars are 5 pixels wide

Here we can see that the edges of the bars are noticeably more blurry, but they do appear much smoother than the version with box filtering above. Also note that the four rows now look much more alike, which is an improvement on the box filtered version.

Now for the promised frequency folding. If we define a cycle as one black bar followed by one white bar, then the 5-pixel wide bars give us a cycle, or period, of 10 pixels. Frequency is simply 1 over period, or 1/10 cycles per pixel. Since the frequency is less than one cycle per pixel, we know that we can accurately represent the bars at this frequency. We know that 0.5 cycle per pixel corresponds to the Nyquist limit, i.e., the highest frequency that can be represented, which corresponds to a pattern of a 1-pixel wide black bar followed by a 1-pixel wide white bar. Frequency folding dictates that a pattern with a frequency of 1 - (1/10) = 0.9 cycles per pixel will be aliased with a frequency of 1/10 cycles per pixel. We know that 0.9 cycles per pixel corresponds to a cycle of 1.1111' pixels in length, which implies that the bars will have a width of 0.5555' pixels, which we also know cannot be represented accurately, since the bars are smaller than the pixels. If we render the same bar pattern with box filtering, but choosing a bar pattern in which the bars are 0.5555' pixels wide, we get this image:

Box filtered, bars are 0.5555' pixels wide


Not what you were expecting? One would have expected that the bars disappear into a uniform grey patch, since we know that 50% of each row should be black, and 50% should be white. We also know the bars are only supposed to be 0.5555' pixels wide, so why are we seeing bars that are 5 pixels wide? Well, this is a textbook case of aliasing --- the frequency of 0.9 cycles per pixel is aliased, or folded, around 0.5, which gives us 0.1 (0.5 - (0.9 - 0.5) = 0.1) . We have seen above that 0.1 cycles per pixel gives us 10 pixels per cycle, or bars that are 5 pixels wide.

That explains the width of the bars. But why are they grey, and not black-and-white? The MTF plot of the box filter above provides the answer: at 0.9 cycles per pixel, contrast has dropped to roughly 10%, which manifests as both white and black moving closer to the intensity midpoint of 50% grey (remember, the image shown here are at a gamma value of 2.2). If we see an image such as the one above (box filtered, bar width=0.5555' pixels), how do we know whether we are looking at a pattern with bar width 5 at 10% contrast, or a pattern with bar width 0.5555' pixels at 100% contrast? There is no way of telling, which is why aliasing is a destructive process. By the time you have captured the image with your sensor (without AA filter), you have already lost the ability to tell these two cases apart.

Before concluding, we have to take a quick look at the same pattern, but filtered with a Gaussian point spread function:

Gaussian filtered (standard dev. = 0.568 pixels), bars are 0.5555' pixels wide
If you look closely, you can see that the grey patches are still composed of grey bands of slightly varying shades, but it is immediately clear that the contrast between the bands has dropped significantly. A Gaussian point spread function with a standard deviation of approximately 0.568 has an MTF curve described by the function exp(-6.365x²), which means that at 0.9 cycles per pixel, contrast has dropped to only 0.577%. Incidentally, this Gaussian point spread function appears to be quite close to the AA filter in a Nikon D40 or a Nikon D7000.

Personally, I prefer the fade-to-grey approach. While the box-filtered image (with bars of width 0.5555' pixels) appears to have detail at 0.1 cycles per pixel (bars of width 5 pixels), this detail is false, and was never present in the scene we captured with our sensor.

Conclusion

So which is better: a sensor with an AA filter, or one without? This depends a great deal on what you plan to do with that sensor. A sensor without an AA filter will be more susceptible to aliasing than one with a filter, but this depends critically on the rest of the parameters of the entire optical system.

For example, if your pixel pitch is rather small, say around 1.5 microns, like the sensors found in camera phones, then diffraction will act as a natural AA filter. For a larger pixel pitch of around 5 microns or larger, diffraction will only start acting as an AA filter at small relative apertures (say, beyond f/8, depending on physical sensor size). For large pixels at large apertures, the difference between an AA filter sensor, and one without, boils down to exactly the differences illustrated above. (Update: new post on taking diffraction into account).

Bayer sensors are another matter entirely, but suffice it to say that you must give up some resolution in order to reconstruct colours accurately with this type of sensor.

For photography, I would much rather have a slightly softer image, which can be sharpened afterwards, than a sharper image filled with artifacts that cannot be removed automatically. Remember, blurring the image after capture will not remove the aliasing; the blur must be applied before the sensor samples the incoming scene.

Lastly, always remember that aliasing is only present if your scene contains frequencies above the Nyquist limit of your sensor. Not all images captured without an AA filter will contain aliasing.