June 2014

Volume 29 Number 6

DirectX Factor : The Canvas and the Camera

Charles Petzold

Charles PetzoldAbout 15 years ago, English artist David Hockney began developing a theory about Renaissance art that turned out to be quite controversial. Hockney marveled at how Old Masters such as van Eyck, Velázquez, and Leonardo were able to render the visual world on canvas with astounding accuracy in perspective and shading. He became convinced that this accuracy was possible only with the help of optical tools made from lenses and mirrors, such as the camera obscura and camera lucida. These devices flatten the 3D scene and bring it closer to the artist for reproduction. Hockney published his thesis in a gorgeous and persuasive book entitled, “Secret Knowledge: Rediscovering the Lost Techniques of the Old Masters” (Viking, 2001).

More recently, inventor Tim Jenison (founder of NewTek, the company that developed Video Toaster and LightWave 3D), became obsessed with using optical tools to recreate Jan Vermeer’s 350-year-old painting “The Music Lesson.” He built a room similar to the original, decorated it with furniture and prop reproductions, recruited live models (including his daughter), and used a simple optical tool of his own invention to paint the scene. The jaw-­dropping results are chronicled in the fascinating documentary, “Tim’s Vermeer.”

Why is it necessary to use optical devices—or, in these days, a simple camera—to capture 3D scenes on 2D surfaces accurately? Much of what we think we see in the real world is constructed in the brain from relatively sparse visual information. We think we see a whole real-world scene in one big panorama, but at any point in time, we’re really focused only on a small detail of it. It’s virtually impossible to piece together these separate visual fragments into a composite painting that resembles the real world. It’s a lot easier when the canvas is supplemented with a mechanism much like a camera—which itself mimics the optics of the human eye, but without the eye’s deficiencies.

A similar process occurs in computer graphics: When rendering 2D graphics, a metaphorical canvas works just fine. The canvas is a drawing surface, and in its most obvious form, it corresponds directly to the rows and columns of pixels that comprise the video display.

But when moving from 2D to 3D, the metaphorical canvas needs to be supplemented with a metaphorical camera. Like a real-world camera, this metaphorical camera captures objects in 3D space and flattens them onto a surface that can then be transferred to the canvas.

Camera Characteristics

A real-world camera can be described with several characteristics that are easily converted into the mathematical descriptions required by 3D graphics. A camera has a particular position in space, which can be represented as a 3D point.

The camera is also aimed in a particular direction: a 3D vector. You can calculate this vector by obtaining the position of an object the camera is pointed directly at, and subtracting the camera position.

If you keep a camera fixed at a particular position and pointed in a particular direction, you still have the freedom to tilt the camera side to side, and actually rotate it 360 degrees. This means that another vector is required to indicate “up” relative to the camera. This vector is perpendicular to the camera direction.

After establishing how the camera is positioned in space, you get to fiddle around with the knobs on the camera.

Today’s cameras are often capable of zoom: You can adjust the camera lens from a wide angle that encompasses a big scene to a telephoto view that narrows for a close up. The difference is based on the distance between the plane that captures the image, and the focal point, which is a single point through which the light passes, as shown in Figure 1. The distance between the focal plane and the focal point is called the focal length.

The Focal Plane, Focal Point and Focal Length
Figure 1 The Focal Plane, Focal Point and Focal Length

The sizes of the focal plane and the focal length imply an angular field of view emanating from the focal point. Telephoto views (more correctly called “long focus”) are generally associated with a field of view less than 35 degrees, while a wide angle is greater than 65 degrees, with more normal fields of view in between.

Computer graphics add a characteristic of the camera not possible in real life: In 3D graphics programming you often have a choice between cameras that achieve either perspective or orthographic projection. Perspective is like real life: Objects further from the camera appear to be smaller because the field of view encompasses a greater range further from the focal point.

With orthographic projection, that’s not the case. Everything is rendered in a size relative to the object’s actual size, regardless of its distance from the camera. Mathematically, this is the simpler of the two projections, and is most appropriate for technical and architectural drawings.

The Camera Transforms

In 3D graphics programming, cameras are mathematical constructs. The camera consist of two matrix transforms much like those that manipulate objects in 3D space. The two camera transforms are called View and Projection.

The View matrix effectively positions and orients the camera in 3D space; the Projection matrix describes what the camera “sees” and how it sees it. These camera transforms are applied after all the other matrix transforms are used to position objects in 3D space, often called “world space.” Following all the other transforms, first the View transform is applied, and finally the Projection transform.

In DirectX programming—whether Direct3D or the exploration of 3D concepts in Direct2D—it’s easiest to construct these matrix transforms using the DirectX Math Library, the collection of functions in the DirectX namespace that begin with the letters XM and use the XMVECTOR and XMMATRIX data types. These two data types are proxies for CPU registers, so these functions are often quite speedy.

Four functions are available to calculate the View matrix:

  • XMMatrixLookAtRH (EyePosition, FocusPosition, UpDirection)
  • XMMatrixLookAtLH (EyePosition, FocusPosition, UpDirection)
  • XMMatrixLookToRH (EyePosition, EyeDirection, UpDirection)
  • XMMatrixLookToLH (EyePosition, EyeDirection, UpDirection)

The function arguments include the word “Eye” but the documentation uses the word “camera.”

The LH and RH abbreviations stand for left-hand and right-hand. I’ll be assuming a left-hand coordinate system for these examples. If you point the index finger of your left hand in the direction of the positive X axis, and your middle finger in the direction of positive Y, your thumb will point to positive Z. If the positive X axis goes right, and positive Y goes up (a common orientation in 3D programming) then the –Z axis comes out of the screen.

All four functions require three objects of type XMVECTOR and return an object of type XMMATRIX. In all four functions, two of these arguments indicate the camera position (labeled as EyePosition in the function template) and the UpDirection. The LookAt functions include a FocusPosition argument—a position the camera is pointed at—while the LookTo functions have an EyeDirection, which is a vector. It’s just a simple calculation to convert from one form to the other.

For example, suppose you want to position the camera at the point (0, 0, –100), pointing toward the origin (and, hence, in the direction of the positive Z axis), with the top of the camera pointing up. You can call either

  XMMatrixLookAtLH(XMVectorSet(0, 0, -100, 0),
                   XMVectorSet(0, 0, 0, 0),
                   XMVectorSet(0, 1, 0, 0));


  XMMatrixLookToLH(XMVectorSet(0, 0, -100, 0),
                   XMVectorSet(0, 0, 1, 0),
                   XMVectorSet(0, 1, 0, 0));

In either case, the function creates this View matrix:

This particular View matrix simply shifts the entire 3D scene 100 units in the direction of the positive Z axis. Many View matrices will also involve rotations of various sorts, but no scaling. After the View transform is applied to the 3D scene, the camera can be assumed to be positioned at the origin (with the top of the camera pointed in the direction of positive Y) and pointed in the positive Z direction (for a left-hand system) or the –Z direction (right-hand). This orientation allows the Projection transform to be much simpler than it would be otherwise.

The Projection Conventions

In this column’s previous forays into 3D programming within Direct2D, I’ve converted objects from 3D space to the 2D video display simply by ignoring the Z coordinate.

It’s time to convert from 3D to 2D in a more professional manner using conventions that are encapsulated in the standard camera Projection matrices. The standard conversion of 3D to 2D actually occurs in two stages: first from 3D coordinates to normalized 3D coordinates, and then to 2D coordinates. The Projection transform specified by the program controls the first conversion. In Direct3D programming, the second conversion is usually performed automatically by the rendering system. A program that uses Direct2D to display 3D graphics must perform this second conversion itself.

The purpose of the Projection transform is in part to normalize all the 3D coordinates in the scene. This normalization defines what objects are visible in the final rendering and which are excluded. Following this normalization, the final rendered scene encompasses X coordinates ranging from –1 at the left and 1 at the right, Y coordinates ranging from –1 on the bottom to 1 at the top, and Z coordinates ranging from 0 (closest to the camera) to 1 (furthest from the camera). The Z coordinates are also used to determine what objects obscure other objects relative to the viewer.

Everything not in this space is discarded, and then the normalized X and Y coordinates are mapped to the width and height of the display surface, while the Z coordinates are ignored.

To normalize the Z coordinates, the functions that compute a Projection matrix always require arguments of type float named NearZ and FarZ that indicate a distance from the camera along the Z axis. These two distances are converted to normalized Z coordinates of 0 and 1, respectively.

This is somewhat counterintuitive because it implies there’s an area of 3D space that’s too close to the camera to be visible, and another area that’s too far away. But for practical reasons it is necessary to limit depth in this way. Everything behind the camera must be eliminated, for example, and objects too close to the camera would obscure everything else. If Z coordinates out to infinity were allowed, the resolution of floating point numbers would be taxed when determining what objects overlap others.

Because the camera View matrix accounts for possible translation and rotation of the camera, the Projection matrix is always based on a camera located at the origin and pointing along the Z axis. I’ll be using left-hand coordinates for these examples, which means the camera is pointed in the direction of the positive Z axis. Left-hand coordinates are a little simpler when dealing with Projection transforms because NearZ and FarZ are equal to coordinates along the positive Z axis rather than the –Z axis.

The DirectX Math Library defines 10 functions for calculating the Projection matrix—four for orthographic projections and six for perspective projections, half for left-hand coordinates and half for right-hand.

In the XMMatrixOrthographicRH and LH functions, you specify ViewWidth and ViewHeight along with NearZ and FarZ. Figure 2 is a view looking down on the 3D coordinate system from a location on the positive Y axis. This figure shows how these arguments define a cuboid in a left-hand coordinate system viewable to an eyeball at the origin.

A Top View of an Orthographic Transform
Figure 2 A Top View of an Orthographic Transform

Often, the ratio of ViewWidth to ViewHeight is the same as the aspect ratio of the display used for rendering. The projection transform scales everything from –ViewWidth / 2 to ViewWidth / 2 to the range –1 to 1, and later those normalized coordinates are scaled by half the pixel width of the display surface for rendering. The calculation is similar for ViewHeight.

Here’s a call to XMMatrixOrthographicLH with ViewWidth and ViewHeight set to 40 and 20, and NearZ and FarZ set to 50 and 100, which matches the diagram assuming tick marks every 10 units:

XMMATRIX orthographic =
  XMMatrixOrthographicLH(40, 20, 50, 100);

This results in the following matrix:

The transform formulas are:

You can see that x values of –20 and 20 are transformed to –1 and 1, respectively, and that y values of –10 and 10 are transformed to –1 and 1, respectively. A Z value of 50 is transformed to 0, and a Z value of 100 is transformed to 1.

Two additional Orthographic functions contain the words OffCenter and let you specify left, right, top, and bottom coordinates rather than widths and heights.

The XMMatrixPerspectiveRH and LH functions have the same arguments as XMMatrixOrthograhicRH and LH, but define a four-sided frustum (like a pyramid with its top cut off) as shown in Figure 3.

A Top View of a Perspective Transform
Figure 3 A Top View of a Perspective Transform

The ViewWidth and ViewHeight arguments in the transform functions control the width and height of the frustum at NearZ, but the width and height at FarZ is proportionally larger based on the ratio of FarZ to NearZ. This diagram also demonstrates how a greater range of x and y coordinates further from the camera are mapped into the same space (and, hence, are made smaller) as the x and y coordinates nearer the camera.

Here’s a call to XMMatrixPerspectiveLH with the same arguments I used for XMMatrixOrthographicLH:

XMMATRIX perspective =
  XMMatrixPerspectiveLH(40, 20, 50, 100);

The matrix created from this call is:

Notice that the fourth column indicates a non-affine transform! In an affine transform, the m14, m24, and m34 values are 0, and m44 is 1. Here, m34 is 1 and m44 is 0.

This is how perspective is achieved in 3D programming environments, so let’s look at the transform multiplication in detail:

The matrix multiplication results in the following transform formulas:

Notice the w´ value. As I discussed in the April installment of this column, the use of w coordinates in 3D transforms ostensibly accommodates translation, but it also brings in the mathematics of homogenous (or projective) coordinates. Affine transforms always take place in the 3D subset of 4D space where w equals 1, but this non-affine transform has moved coordinates out of that 3D space. The coordinates must be moved back into that 3D space by dividing them all by w´. The transform formulas that incorporate this necessary adjustment are instead:

When z equals NearZ or 50, the transform formulas are the same as for the orthographic projection:

Values of x from -20 to 20 are transformed into x´ values from ­-1 to 1, for example.

For other values of z, the transform formulas are different, and when z equals FarZ or 100, they look like this:

At this distance from the camera, values of x from ­-40 to 40 are transformed into x´ values from -1 to 1. A larger range of x and y values at FarF occupy the same visual field as a smaller range at NearZ.

As with the orthographic functions, two additional functions have the words OffCenter in the function names, and let you set left, right, top, and bottom coordinates rather than widths and heights.

The XMMatrixPerspectiveFovRH and LH functions let you specify an angular field of view (FOV) rather than a width and height. This field of view is likely different along the X and Y axes. You need to specify it along the Y axis, and also provide a ratio of width to height.

To create a Perspective matrix consistent with the preceding example, the field of view can be calculated with the atan2 function, with a y argument equal to half the height at NearZ, and the x argument equal to NearZ, and then doubling the result:

float angleY = 2 * atan2(10.0f, 50.0f);

The second argument is the ratio of width to height, or 2 in this example:

XMMATRIX perspective =
  XMMatrixPerspectiveFovLH(angleY, 2, 50, 100);

This call results in a Perspective matrix identical to the one just created with XMMatrixPerspectiveLH.

A Circle of Text

In February’s installment of this column, I demonstrated how to create a 3D circle of text, and animate its rotation. However, I used Direct2D geometries for that exercise and encountered very poor performance. In the March column I demonstrated how to tessellate text into a collection of triangles, which seemed to have considerably better performance than geometries. In the May column, I demonstrated how to use triangles to create and render 3D objects.

It’s time to put these different techniques together. The downloadable source code for this column includes a Windows 8 Direct2D project called TessellatedText3D. For this program I defined a 3D triangle using XMFLOAT3 objects:

struct Triangle3D
  DirectX::XMFLOAT3 point1;
  DirectX::XMFLOAT3 point2;
  DirectX::XMFLOAT3 point3;

The constructor of the TessellatedText3DRenderer class loads in a font file, creates a font face, and generates an ID2D1Path­Geometry object from the GetGlyphRunOutline method using an arbitrary font size of 100. This geometry is then converted into a collection of triangles using the Tessellate method with a custom tessellation sink. With the particular font, font size and glyph indices I specified, Tessellate generates 1,741 triangles.

The 2D triangles are then converted into 3D triangles by wrapping the text in a circle. Based on that arbitrary font size of 100, this circle happens to have a radius of about 200 (stored in m_source­Radius), and the circle is centered on the 3D origin.

In the Update method of the TessellatedText3DRenderer class, the XMMatrixRotationY and XMMatrixRotationX functions provide two transforms to animate a rotation of the text around the X and Y axes. These are stored in XMMATRIX objects named rotateMatrix and tiltMatrix, respectively.

The Update method than continues with the code shown in Figure 4. This code calculates the View and Projection matrices. The View matrix sets the camera position on the –Z axis based on the circle radius, so the camera is a radius length outside the circular text but pointed towards the center.

Figure 4 The View and Projection Matrices in TessellatedText3D

void TessellatedText3DRenderer::Update(DX::StepTimer const& timer)
  // Calculate camera view matrix
  XMVECTOR eyePosition = XMVectorSet(0, 0, -2 * m_sourceRadius, 0);
  XMVECTOR focusPosition = XMVectorSet(0, 0, 0, 0);
  XMVECTOR upDirection = XMVectorSet(0, 1, 0, 0);
  XMMATRIX viewMatrix = XMMatrixLookAtLH(eyePosition,
  // Calculate camera projection matrix
  float width = 1.5f * m_sourceRadius;
  float nearZ = 1 * m_sourceRadius;
  float farZ = nearZ + 2 * m_sourceRadius;
  XMMATRIX projMatrix = XMMatrixPerspectiveLH(width,
  // Calculate composite matrix
  XMMATRIX matrix = rotateMatrix * tiltMatrix *
                    viewMatrix * projMatrix;
  // Apply composite transform to all 3D triangles
    (XMFLOAT3 *) m_dstTriangles3D.data(),
    (XMFLOAT3 *) m_srcTriangles3D.data(),
    3 * m_srcTriangles3D.size(),

With arguments also based on the circle radius, the code continues by calculating a Projection matrix, and then multiplies all the matrices together. The XMVector3TransformCoordStream uses parallel processing to apply this transform to an array of XMFLOAT3 objects (actually an array of Triangle3D objects), automatically performing the division by w*'*.

The Update method continues beyond what’s shown in Figure 4 by converting those transformed 3D coordinates to 2D using a scaling factor based on half the width of the video display, and ignoring the z coordinates. The Update method also uses the 3D vertices of each triangle to calculate a cross product, which is the surface normal—a vector perpendicular to the surface. The 2D triangles are then divided into two groups based on the z coordinate of the surface normal. If the z coordinate is negative, the triangle is facing toward the viewer, and if it’s positive, the triangle is facing away. The Update method concludes by creating two ID2D1Mesh objects based on the two groups of 2D triangles.

The Render method then displays the mesh of rear triangles with a gray brush and the front mesh with a blue brush. The result is shown in Figure 5.

The TessellatedText3D Display
Figure 5 The TessellatedText3D Display

As you can see, the front-facing triangles closer to the viewer are much larger than the back-facing triangles further away. This program has none of the performance problems encountered with rendering geometries.

Shading with Light?

In the May installment of this column, I took light into account when displaying 3D objects. The program assumes light comes from a particular direction and it calculates different shading for the sides of the solid figures based on the angle that the light strikes each surface.

Is something similar possible with this program?

In theory, yes. But my experiments revealed some serious performance problems. Rather than creating two ID2D1Mesh objects in the Update method, a program implementing shading of the text needs to render each triangle with a different color, and that requires 1,741 different ID2D1Mesh objects recreated in each Update call, and 1,741 corresponding ID2D1SolidColorBrush objects. This slowed the animation down to approximately one frame per second.

What’s worse, the visuals weren’t satisfactory. Each triangle gets a different discrete solid color based on its angle to the light source, but the boundaries between these discrete colors became visible! Triangles used to render 3D objects must more properly be colored with a gradient between the three vertices, but such a gradient is not supported by the interfaces that derive from ID2D1Brush.

This means I must dig even deeper into Direct2D and get access to the same shading tools that Direct3D programmers use.

Charles Petzold is a longtime contributor to MSDN Magazine and the author of “Programming Windows, 6th Edition” (Microsoft Press, 2013), a book about writing applications for Windows 8. His Web site is charlespetzold.com.

Thanks to the following Microsoft technical expert for reviewing this article: Doug Erickson
Doug Erickson is a lead programming writer for Microsoft’s OSG developer documentation team. When not writing and developing DirectX graphics code and content, he is reading articles like Charles’, because that’s how he likes to spend his free time. Well, that, and riding motorcycles.