MTF Mapper: Limitations of using single-shot planar targets to perform automatic camera calibration

When you are trying to measure the performance of your system across the entire field, it is rather important to ensure that your test chart is parallel to your sensor. If you are not careful, then a slight tilt in your test chart could look very much like a tilted lens element if you are looking at the MTF values, i.e., two opposite corners of your MTF image would appear to be soft: is your lens titled along the diagonal, or is the chart tilted along the same diagonal?

My solution to this problem is to directly estimate the camera pose from the MTF test chart. I have embedded fiducial markers in the latest MTF Mapper test charts which will allow me to measure the angle between your sensor and your test chart. This post details a particular difficulty I encountered while implementing the camera pose estimation method as part of MTF Mapper.

The classical approach

Classical planar calibration target methods like Tsai [Tsai1987] or Zhang [Zhang2000] prescribe that you capture several images of your planar calibration target, while ensuring that there is sufficient translation and rotation between the individually captured images. From each of the images you can extract a set of correspondences, e.g., the location of a prominent image feature (corner of a square, for example) and the corresponding real-world coordinates of that feature.

This sounds tricky, until you realize that you are allowed to express the real-world coordinates in a special coordinate system attached to your planar calibration target. This implies that you can put all the reference features at z=0 in your world coordinate system (their other two coordinates are known through measurement with a ruler, for example), meaning that even if you moved the calibration object (rather than the camera) to capture your multiple calibration images, the model assumes that the calibration object was fixed and the camera moved around it.

A set of four such correspondences are sufficient to estimate a 3x3 homography matrix up to a scale factor, since four correspondences yields 8 equations to solve for the 8 free parameters of the matrix. A homography is a linear transformation that can map one plane onto another, such as mapping our planar calibration target onto the image sensor. For each of our captured calibration images we can solve these equations to obtain a different homography matrix. The key insight is that this homography matrix can be decomposed to separate the intrinsic camera parameters from the extrinsic camera parameters. We can use a top-down approach to understand how the homography matrix is composed.

To keep things a bit simpler, we can assume that the principal point of the system is fixed at the centre of the captured image. We can thus normalize our image coordinates so that the principal point maps to (0,0) in normalized image coordinates, and while we are at it we can divide the result by the width of the image so that x coordinates run from -0.5 to 0.5 in normalized image coordinates. This centering and rescaling generaly improves the numerical stability of the camera parameter estimation process. This gives us the intrinsic camera matrix K, such that

where f denotes the focal length of the camera. Note that I am forcing square pixels without skew. This appears to be a reasonable starting point for interchangeable lens cameras. We can combine the intrinsic camera parameters and the extrinsic camera parameters into a single 3x4 matrix P, such that

where the 3x3 matrix R represents a rotation matrix, and the vector t represents a translation vector. The extrinsic camera parameters R and t is often referred to as the camera pose, and represents the transformation required to transform from world coordinates (i.e., our calibration target local coordinates) to homogeneous camera coordinates. If we have multiple calibration images, then we obtain a different R and t for each image, but the intrinsic camera matrix K must be common to all views of the chart.

The process of estimating K and the set of R_i and t_i over all the images i is called bundle adjustment [Triggs1999]. Typically we will use all the available point correspondences (hopefully more than four) from each view to minimized the backprojection error, i.e., we take our known chart-local world coordinates from each correspondence, transform it with the appropriate P matrix, divide by the third (z) coordinate to convert homogeneous coordinates to normalized image coordinates, and calculate the Euclidean distance between this back-projected image point and the measured image coordinates (e.g., output of a corner-finding algorithm) of the corresponding point in the captured image. The usual recommendation is to use a Levenberg-Marquardt algorithm to solve this non-linear optimization problem to minimize the sum of the squared backprojection errors.

Strictly speaking, we usually include a radial distortion coefficient or two in the camera model to arrive at a more realistic camera model than the pinhole model presented here, but I am going to ignore radial distortion here to simplify the discussion.

Single-view calibration using a planar target

From the definition of the camera matrix P above we can see that even if we only have a single view of the planar calibration target, we can still estimate both our intrinsic and extrinsic camera parameters using the usual bundle adjustment algorithms. Zhang observed that when a planar calibration target is employed, we can estimate a 3x3 homography matrix H such that

where the vectors r₁ and r₂ define the first two basis vectors of the world coordinate frame in camera coordinates, and t is a translation vector. Since we require r₁ and r₂ to be orthonormal, the third basis vector of the world coordinate frame is just the cross product of r₁ and r₂. This little detail explains how the 8 free parameters of the homograph H are able to represent all the required degrees of freedom we expect in our full camera matrix P.

In the previous section we restricted our intrinsic camera parameters to a single unknown f, since both P_x and P_y are already know because we assume the principal point coincides with the image centre. With a little bit of algebraic manipulation we can see that Zhang's orthonormality constraints allows us to estimate the focal length f directly from the homography matrix H (see Appendix A below).

So this leaves me with a burning question: if we can estimate all the required camera parameters using only a single view of a planar calibration target, why do all the classical methods require multiple views (with different camera poses)?

Limitations of single-view calibration using planar targets

To answer that question, we simply have to find an example of where the single-view case would fail to estimate the camera parameters correctly. The simplest case would be to assume that our rotation matrix R is the 3x3 identity matrix (camera axis is perpendicular to planar calibration target), and that our translation vector is of the form [0 0 d] where d represents the distance of the calibration target from the camera's centre of projection. This scenario reduces our camera matrix P to

A given point [x y 0] in world coordinates is thus transformed to [fx fy d] in homogeneous camera coordinates. We can divide out the homogeneous coordinate to obtain our desired normalized image coordinates as [fx/d fy/d].

And there we see the problem: the normalized image coordinates depend only on the ratio f/d, which implies that we do not have sufficient constraints to estimate both f and d from this single view. The intuitive interpretation is simple to understand: you can always increase d, i.e., move further away from the calibration target while adjusting the focal length f (zooming in) to keep f/d constant without affecting the image captured by the camera.

This happens because there is no variation in the depth of the calibration target correspondence points expressed in camera coordinates, thus the depth-dependent properties of a perspective projection are entirely absent.

We can try to apply the formula in Appendix A to estimate the focal length directly from the homography corresponding to the matrix P above, but we quickly run into a divide-by-zero problem. This should give us a hint. If we choose to ignore the hint, we can apply a bundle adjustment algorithm to estimate both the intrinsic and extrinsic camera parameters from correspondences generated using the matrix P. All that this will achieve is that we will find an arbitrary pair of f and d values that satisfy the constant ratio f/d imposed by P.

The middle road

What happens if we have a slightly less pathological scenario? Let us assume that there is a small tilt between the calibration target plane and the sensor. For simplicity, we can just choose a rotation around the y axis so that

We know that for a small angle θ, sin(θ) ≈ 0, so our matrix P will be very similar to the sensor-parallel-to-chart case above. The corresponding homography H should be

We can apply the formula in Appendix A to H, which simplifies to f² = f², which is a relief. The question is: how accurately can we estimate the homography H using actual correspondences extracted from the captured images?

I know from simulations using MTF Mapper that the position of my circular fiducials can readily be estimated to an accuracy of 0.1 pixels under fairly heavy simulated noise. The objective now is to measure the impact of this uncertainty on the accuracy of the homography estimated using OpenCV's findHomography function. I start out with a camera matrix P like the one above with only a rotation around the y axis. A set of 25 points are generated on my virtual calibration target, serving as the world coordinates (with the same real-world dimensions as the actual A3 chart used by MTF Mapper). These are transformed using P to obtain the `perfect' simulated corresponding image coordinates representing the position of the fiducials. I perturb these perfect coordinates by adding Gaussian noise with a standard deviation of about 0.000020210 units, which corresponds to an error of 0.1 pixels, but expressed in normalized image coordinates (divided by 4948, the width of a D7000 raw image). Now I can systematically measure the uncertainty in the focal length estimated with the formula of Appendix A as a function of the angle between the chart and the sensor, θ. I ran 100000 iterations at a selection of angles, and calculated the difference between the 75th and 50th percentile of the estimated focal length as a measure of spread.

Figure 1

In Figure 1 we see that the spread of the focal length estimates increases dramatically once the angle θ drops below about 2 degrees. For the purpose of using the estimated camera pose to measure if you have aligned your chart parallel to your camera sensor, this is really terrible news: essentially, we cannot estimate the focal length of the camera reliably if the chart is close to being correctly aligned.

Figure 2

Figure 2 shows that the focal length estimate is relatively unbiased for angles above about 1 degree, but once the angle becomes small enough, we overestimate the focal length dramatically.

This experiment demonstrated that small errors in the estimated position of features (e.g., corners or centre of circular targets) leads to dramatic errors in focal length estimation. Intuitively, this makes sense, since the relative magnitude of perspective effects decreases the closer we approach a parallel alignment between the sensor and the calibration target. Since perspective effects depend on the distance from the chart, and the estimated distance from the chart is effectively controlled by the estimated focal length (assume the same framing), this seems reasonable.

I have tried using bundle adjustment, rather than homography estimation as an intermediate step, but clearly the problem lies with the unfavourable viewing geometry and the resulting subtlety of the perspective effects, not with the algorithm used to estimate the focal length. At least, as far as I can tell.

Hobson's choice

If we take the focal length of the camera as a given parameter, then the ambiguity is resolved, and we can obtain a valid, unique estimate of the calibration target distance d. This is not entirely surprising, since our assumed constrained intrinsic camera parameters depend only of the focal length f, i.e., K is known, thus the pose of the camera can be estimated for any given view, even the degenerate case where the calibration target is parallel to the sensor.

In other words, I see no way other than requiring the user to specify the focal length as an input to MTF Mapper. I will try to extract this information from the EXIF data when the MTF Mapper GUI is used, but it seems that not all cameras report this information. Fortunately, it seems that a user-provided focal length need not be 100% accurate in order to obtain a reasonable estimate of the chart orientation relative to the camera.

References

[Zhang2000], Z. Zhang, A flexible new technique for camera calibration, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11), pp. 1330-1334, 2000.
[Tsai1987], R. Tsai, A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses, IEEE Journal on Robotics and Automation, 3(4), pp. 323-344, 1987.
[Triggs1999], B. Triggs, P. McLauchlan, R. Hartley, A. Fitzgibbon, Bundle Adjustment — A Modern Synthesis, ICCV '99: Proceedings of the International Workshop on Vision Algorithms, Springer-Verlag, pp. 298-372, 1999.

Appendix A

If we have a homography H between our normalized image coordinate plane and our planar calibration target, such that

where h₃₃ is an arbitrary scale factor, then the focal length of the camera can be estimated assuming square pixels, zero skew and a principal point of (0,0) in normalized image coordinates, using the formula

Note that this is only one possibility, derived from the constraint that r₁ is a unit vector.

2 comments:

bokesan8 February 2017 at 13:22
Hello Frans,

while I think it's no problem to supply the focal length, this got me thinking about non-planar charts:

What if you could place one (or more) of the fiducial markers a bit closer to the camera - this would be easy if you have the chart on a magnetic board and the markers on a block with a magnet. I wonder how much 3D-depth you would need for acceptable precision - 1cm, 10cm?

Chris

MTF Mapper

Wednesday, 8 February 2017

Limitations of using single-shot planar targets to perform automatic camera calibration