How does JPEG actually work?

JPEG is an image encoding designed to compress photographs and
similar images effectively, often 5 to 15 times over a raw bitmap
format. It's a lossy format that exploits properties of human vision to
eliminate information that is difficult to distinguish. You see JPEGs
all the time, and if you've seen a JPEG compressed with a low quality
level, you have a good idea of the kinds of visual artifacts that
emerge when JPEG compression is pushed to its limits. But how does it
really work, and how does this cause JPEG artifacts?

The JPEG standard defines a number of different encoding options,
and there are a number of container file formats defined other than the
usual JFIF, but I'm going to discuss a single popular JPEG encoding
method used by many applications. This encoding combines three
different completely independent methods of image compression:
downsampling the chrominance channels, quantization of the discrete
cosine transformation, and entropy coding.

Method 1: Downsampling the chrominance channels

The most obvious way to make an image file smaller is to make the
image smaller - for example, a 100×100 bitmap is 1/4 the size of a
200×200 bitmap. You can expand the image back out to the original size
when viewing it by simply upsampling it. This isn't a great compression
method, since it causes clearly visible artifacts, usually blocky or
blurry effects, especially around detailed, sharp areas.

A different approach is instead of downsampling the entire image, to
only downscale certain channels of the image. For example, the eye's
red-green sensors are more numerous and sensitive than its blue-yellow
sensors, so if we shrink just the blue channel and later re-expand it,
the effect is much less noticable. But the RGB color space is not our
only choice - images can be encoded in many color systems, and for each
of these the channels we choose to downsample will have different
visual effects. This experiment
by Eric Setton of Stanford University demonstrates the effect on image
quality of downsampling various channels in various color systems.

The results are visually obvious: we are sensitive to red and green,
but even more sensitive to intensity or luminance, the relative
brightness of pixels compared to other nearby pixels. This is expressed
by the Y component of the YCbCr system and the V component of the HSV
system. Any downsampling of this information is visually disastrous. On
the other hand, downsampling the chrominance, expressed by the
Cb and Cr components of the YCbCr system, has almost no visual effect
(Cb and Cr stand for "chrominance blue" and "chrominance red" - see a sample image broken up into YCbCr channels in the middle of this page).
For this reason, typical JPEG files convert images to the YCbCr color
space, and then downsample both the Cb and Cr channels by a factor of 2
in both width and height. Further processing of the image proceeds on
each of the three channels independently.

This is the method in JPEG encoding largely responsible for the
artifact known as "color bleeding", where along sharp edges the color
on one side can "bleed" onto the other side. This is because the
chrominance channels, which express the color of pixels, have had each
block of 4 pixels averaged into a single color, and some of these
blocks cross the sharp edge.

Method 2: Quantization of the discrete cosine transformation

Normally, we view an image as simply a grid of pixels, and if we
alter a pixel or group of pixels dramatically, the effect will be
visually obvious. A different approach to storing images is to come up
with a set of "basis" images which can be added together in various
combinations to form the image we want. JPEG encodes each 8-by-8 square
of pixels using the following basis image set of 64 images:

We assign each of these images a relative weight determining how
much influence it has on the image. This weight can be positive or
negative (when negative, the basis image is subtracted). Taken
together, these 64 weights can describe any 8×8 image at all, and they
describe the image precisely. This example at Wikipedia
shows an 8×8 image, its original pixel array, and its new 8×8 array of
weights after being converted using the discrete cosine transform
(which I'll discuss later).

But what's the point of transforming 64 pixel values into 64
weights? For one thing, the resulting data may exhibit more structure,
and so be easier to compress. For example, with blocks from typical
photographs, the basis images in the upper left above have large
weights and the ones in the lower right have small weights.

But to store the weights, which are real numbers, we have to encode
them somehow. The simplest encoding is to just convert them to integers
between -k and k by scaling and rounding them. This works, but there's a tradeoff in our choice of k:
if it's too big, the encoded image will take too much space, but if
it's too small, the rounding might visibly distort the image, since
it's effectively altering the weights.

A more effective approach is to use a different k for each
basis image. This works well because the visual difference caused by
variations in the weights of different basis images is different. The
eye can detect small variations of the weights of the images in the
upper-left, but can overlook much larger variations in the weights of
the images in the lower-right, so we can get away with smaller k and larger rounding error for these weights.

In practice, rather than deal with expensive floating-point numbers,
we round the initial weights to integers and later divide them by fixed
factors called the quantization factors using integer division.
During decoding, we multiply them by these factors again. The
upper-left images have small quantization factors, so that little
information is lost in this process, while the lower-right images have
larger factors. This example shows a commonly used matrix of quantization
factors, one for each basis image, and the result of quantizing the
sample image using them. Observe how much simpler the matrix becomes -
it now contains a large number of entries that are small or zero,
making it much easier to compress. This example
shows how the weight values expand back out during decoding - notice
that although some of the weights are now different, the visual
difference is almost undetectable.

Quantization is the primary source of JPEG artifacts. Because the images in the lower-right
tend to have the largest quantization divisors, JPEG artifacts will
tend to resemble combinations of these images. The matrix of quantization factors can be directly controlled by
altering the JPEG's "quality level", which scales its values up or down.

Quantization compresses the image in two important ways: one, it
limits the effective range of the weights, decreasing the number of
bits required to represent them. Two, many of the weights become identical or zero,
improving compression in the third step, entropy coding.

But this discussion leaves one important question unanswered: how do
we take an arbitrary square of 8×8 pixels and decompose it into a
combination of the basis images? As it turns out, we specifically chose
the images above in a way to make this easy. These images are formed by
cosine waves of various frequencies in the x and y directions, and a
transformation called the discrete cosine transform from Fourier transformation theory allows us to rapidly decompose a signal into these parts in only O(n log n)
time. Because of the importance of JPEGs, over the years very fast and
clever specialised solutions for 8×8 blocks have been created in both
software and hardware. This is also the reason we use small,
constant-sized blocks instead of processing the image as a whole: the
cosine transform step doesn't need to scale up to large bitmaps, so the
transformation as a whole can run in linear (O(n)) time.

Method 3: Entropy coding

Unlike the other steps, entropy coding is lossless. If you've
studied the simple compression techniques used in ZIP files and BMP
files, the techniques of entropy coding should come as no surprise.

We start by turning our 8×8 grid of integers into a linear sequence
of integers. The obvious row-major and column-major orders are not
ideal here, since we expect there to be many zeros in the lower-right
area. Instead, we start with the top-left corner and zig-zag in a
diagonal pattern across the matrix, going back and forth until we reach
the lower-right corner.

Next, we apply the same run-length encoding algorithm used by GIF
files to condense sequences of identical integer values. In areas where
the values are small, we can expect many such runs, particularly runs
of zeros. We use a special code to say "all the remaining values are
zero", which comes in handy.

Finally, we apply Huffman compression to the remaining sequence,
which is a standard lossless compression technique that encodes
frequently used values using less bits than infrequently used values.

Conclusion

And that's it! For each YCbCr channel, for each 8×8 block, we
perform these transformations, and we store the results in a bitstream.
This data is wrapped up in a JFIF container giving metadata about the
image, and the file goes out ready for web browsers to view, completely
unaware of all the data that has been lost. Just don't go sending JPEGs
into space and expect aliens to see them the same way.

Finally, no discussion of the JPEG encoding would be complete
without noting that JPEG is based on research ideas that are now very
old and largely supplanted. The more modern JPEG 2000 encoding
based on wavelets can be pushed much farther before visible artifacts
appear, and includes a surprisingly efficient lossless mode. Of course,
all this improvement is rather moot until common applications actually
support it.