# Meet me up in space!

Rather than closing the distance, however, the target seemed to move down and away in defiance of everyday intuition…

David Austin
Grand Valley State University

Complex space missions rely on the ability to bring two spacecraft together, a procedure called orbital rendezvous. A spacecraft docking at the International Space Station is a typical example.

Historically, rendezvous was a vital component of the Apollo lunar missions. While three astronauts traveled to the moon in a single craft, the command module, two of them descended to the lunar surface in a second craft, the lunar excursion module (LEM). A successful mission required the ascending LEM to rendezvous with the command module before returning to Earth.

Gemini 4 Capsule at the Smithsonian National Air and Space Museum.

The first attempt at orbital rendezvous was one component of the Gemini IV mission in 1965 when the pilot James McDivitt tried to bring his Gemini capsule close to the spent booster that lifted them into orbit. Upon first seeing the booster at an estimated 120 meters, McDivitt aimed the capsule at the booster and thrusted toward it. Rather than closing the distance, however, the target seemed to move down and away in defiance of everyday intuition. He repeated this procedure several times with similar results.

NASA engineer André Meyer later recalled, “There is a good explanation [for] what went wrong with rendezvous.” Mission planners “just didn’t understand or reason out the orbital mechanics involved” due to competing mission priorities. That’s what we’re going to do here. This column will explain the relative motion of two spacecraft flying closely in Earth orbit and why that motion behaves in such a counter-intuitive way.

Science fiction films often depict spacecrafts that continually alter their trajectories by firing their engines. In the real world, however, fuel is a scarce resource due to the energy required to lift the fuel into space. As a result, spacecraft spend most of their time coasting and only apply short bursts of thrust at opportune moments to adjust their trajectory. This means that a spacecraft operating near, say, Earth or the Moon will spend most of its time traveling in an elliptical orbit, as dictated by Kepler’s second law.

We therefore imagine a target spacecraft $T$ traveling in a circular orbit, and an interceptor spacecraft $I$ attempting to rendezvous with the target and traveling in another elliptical orbit. It’s not essential that the target’s orbit be circular, but it’s a convenient assumption for our purposes. The object of orbital rendezvous is to guide the interceptor so that it comes together with the target.

### Equations of relative motion

The interceptor perceives the target as a fixed point that it wants to move toward. For this reason, we are interested in describing the motion of the interceptor in a coordinate system that rotates with the target, which means we want to understand the evolution of the vector joining the spacecrafts.

As a note to the reader, we’ll walk through some calculations that explain the evolution of this vector and eventually find a rather elegant description. I’d like to give some intuition for what controls the motion, but I won’t give every detail and trust that the reader can fill in as much or as little as they wish. This exposition follows a 1962 technical report by NASA engineer Donald Mueller and a more recent survey by Bradley Carroll. Apollo 11 astronaut Edwin “Buzz” Aldrin’s M.I.T. Ph.D. thesis covers similar ground.

We’ll denote the interceptor’s location by $I$ and the target’s by $T$. The vector ${\mathbf r}$ measures the displacement of the interceptor from the target. The vectors ${\mathbf R}^*$ and ${\mathbf r}^*$ measure the displacement of the target and interceptor from Earth’s center.

Since our goal is to understand the motion of the interceptor $I$ relative to the target $T$, we will focus our attention on ${\mathbf r}$. Moreover, we will consider a coordinate system that rotates with the target $T$ as shown below.

For convenience, we will assume that the target is moving in a circular orbit of radius $R_0$ in the $xy$-plane. We see that ${\mathbf i}$ is a unit vector pointing in the direction of the target’s motion, ${\mathbf j}$ points radially from the center of Earth, and ${\mathbf k}$ is a unit vector pointing out of the page.

The vectors ${\mathbf i}^*$, ${\mathbf j}^*$, and ${\mathbf k}^* = {\mathbf k}$ are fixed. In general, quantities denoted with a $*$ are with respect to this fixed, Earth-centered coordinate system and quantities without a $*$ are with respect to the rotating, target-centered coordinate system.

We can describe $${\mathbf R^*} = R_0\left[\cos(\omega_0T){\mathbf i}^* -\sin(\omega_0t){\mathbf j}^*\right],$$ where $\omega_0$ is the angular velocity of the target. Differentiating twice gives the acceleration $$\frac{d^2}{dt^2}{\mathbf R}^* = R_0\omega_0^2\left[-\cos(\omega_0T){\mathbf i}^* +\sin(\omega_0t){\mathbf j}^*\right].$$

Since gravity is the only force acting on the target, the magnitude of the acceleration vector is $GM/R_0^2$, where $G$ is the gravitational constant and $M$ is Earth’s mass. This tells us that $R_0\omega_0^2 = GM/R_0^2$ so we have $$\omega_0=\sqrt{\frac{GM}{R_0^3}}.$$ Notice that the angular velocity decreases as the target’s altitude increases.

This is really an expression of Kepler’s Third Law for circular orbits. If $P$ is the period of the orbit, then $\omega_0P = 2\pi$, which says that $$P^2 = \frac{4\pi^2}{\omega_0^2} = \frac{4\pi^2}{GM}R_0^3$$ so that the square of the period is proportional to the cube of the radius.

We would like to relate rates of change measured in the fixed, Earth-centered coordinate system to those measured in the rotating, target-centered system. In particular, we are interested in the derivatives $$\frac{d^*}{dt}{\mathbf r},\hspace{24pt} \frac{d}{dt}{\mathbf r},$$ the first of which measures the rate of change of ${\mathbf r}$ in the fixed coordinate system while the second measures the rate of change in the rotating coordinate system.

Writing $${\mathbf r} = x{\mathbf i} + y{\mathbf j} + z{\mathbf k}$$ shows us that $$\frac{d^*}{dt}{\mathbf r} = \left(\frac{d}{dt}x\right){\mathbf i} + \left(\frac{d}{dt}y\right){\mathbf j} + \left(\frac{d}{dt}z\right){\mathbf k} + x\frac{d}{dt}{\mathbf i} + y\frac{d}{dt}{\mathbf j}.$$

Notice that there are some additional terms that come from differentiating the coordinate vectors. Introducing the angular velocity vector ${\mathbf \omega} = -\omega_0{\mathbf k}$ allows us to see, with a little calculation, that $$\frac{d^*}{dt}{\mathbf r} = \frac{d}{dt}{\mathbf r} + \omega\times{\mathbf r}.$$

After differentiating again, one can show that $$\frac{d^2}{dt^2}{\mathbf r} = \frac{{d^*}^2}{dt^2}{\mathbf r} – \omega\times(\omega\times{\mathbf r}) – 2\omega\times\frac{d}{dt}{\mathbf r}.$$

Remember that ${\mathbf r} = {\mathbf r}^* – {\mathbf R}^*$ so that $$\frac{{d^*}^2}{dt^2}{\mathbf r} = \frac{{d^*}^2}{dt^2}{\mathbf r}^* – \frac{{d^*}^2}{dt^2}{\mathbf R}^* = -\frac{GM}{|{\mathbf r}^*|^3}{\mathbf r}^* +\frac{GM}{|{\mathbf R}^*|^3}{\mathbf R}^*.$$ Since the target is moving in a circular orbit of radius $R_0$, we see that ${\mathbf R}^* = R_0{\mathbf j}$. Also, because ${\mathbf r}^* = {\mathbf R}^* + {\mathbf r}$, we have $${\mathbf r}^* = x{\mathbf i} + (R_0+y){\mathbf j} + z{\mathbf k}.$$

Since we are interested in guiding the interceptor to rendezvous with the target, we will assume that the distance between the target and interceptor is much less than the radius of the target’s orbit. For instance, ${\mathbf r} = x{\mathbf i} + y{\mathbf j} + z{\mathbf k}$ may represent a length of a few kilometers, while ${\mathbf R}^* = R_0{\mathbf j}$ represents a length of several thousands of kilometers.

This leads us to approximate $|{\mathbf r}^*| \approx R_0 + y$ and to apply the linear approximation $$\frac{1}{|{\mathbf r}^*|^3} \approx \frac{1}{R_0^3} -\frac{3y}{R_0^4}.$$ With a little algebra, we can write the components of the vector $\frac{d^2}{dt^2}{\mathbf r}$ as \begin{aligned} \frac{d^2x}{dt^2} & = -2\omega_0 \frac{dy}{dt} \\ \frac{d^2y}{dt^2} & = 3\omega_0^2y + 2\omega_0\frac{dx}{dt} \\ \frac{d^2z}{dt^2} & = -\omega_0^2z \end{aligned} These are known as the Hill equations that describe the interceptor’s path in the rotating coordinate frame, and we will see how to find explicit solutions to them.

### Ellipses everywhere

The equation for the $z$-coordinate is a simple harmonic oscillator whose solution is $$z(t) = A\cos(\omega_0t) + B\sin(\omega_0t).$$

Remember that the $z$-axis is fixed in both coordinate systems. This form for $z(t)$ simply expresses the fact that in the fixed, Earth-centric coordinate system, the interceptor is traveling in an elliptical orbit about Earth’s center. More specifically, this expression describes the motion of the interceptor out of the plane of the target’s orbit. To achieve orbital rendezvous, we need to align the plane of the interceptor with that of the target. In practice, this incurs a considerable cost in fuel so care is taken to launch the interceptor in an orbital plane that’s close to that of the target.

We will assume that the interceptor has performed such a maneuver so that its orbital plane agrees with that of the target. In other words, we’ll assume that $z(t) = 0$ constantly and focus on the other two equations: \begin{aligned} \frac{d^2x}{dt^2} & = -2\omega_0 \frac{dy}{dt} \\ \frac{d^2y}{dt^2} & = 3\omega_0^2y + 2\omega_0\frac{dx}{dt} \\ \end{aligned}

We can develop some intuition by thinking about these equations more carefully. Remember that gravity is the only force acting on the interceptor. These equations tell us, however, that in the rotating coordinate system, the interceptor behaves as if it were subjected to two forces per unit mass: \begin{aligned} {\mathbf F}_{\mbox{Tid}} & = 3\omega_0^2y{\mathbf j} \\ {\mathbf F}_{\mbox{Cor}} & = -2\omega_0\frac{dy}{dt}{\mathbf i}+2\omega_0\frac{dy}{dt}{\mathbf j} = -2\omega\times\frac{d}{dt}{\mathbf r}. \\ \end{aligned}

The first ${\mathbf F}_{\mbox{Tid}}$, known as the tidal force, tends to push the interceptor away from the target’s circular orbit and makes $y=0$ an unstable equilibrium.

 The tidal force The Coriolis force

The second force, the Coriolis force, causes the interceptor to rotate counterclockwise as it moves in the rotating coordinate system.

Imagine what happens if we start just a little above the $x$-axis. The tidal force pushes us upward further away from the axis and causes us to pick up speed. The Coriolis force then begins to pull us to the left.

One simple solution to the equations of motion in the rotating frame, which can be verified directly from the equations, is given by \begin{aligned} x(t) & = a\cos(\omega_0 t) \\ y(t) & = \frac a2\sin(\omega_0 t). \end{aligned} Of course, we know by Kepler’s Second Law that the interceptor’s trajectory is elliptical in the fixed, Earth-centered frame. What’s remarkable about this solution is that it says that the interceptor’s path in the rotating frame is also elliptical.

The following figures illustrate this solution in both frames along with the positions at equal time intervals. It’s worth taking a moment to study the relationship between them.

 In the rotating frame In the fixed frame

For this special solution, one can check that $$\frac{d^2}{dt^2}{\mathbf r} = -\omega_0^2{\mathbf r}.$$ This is the equation for a simple harmonic oscillator, the kind of motion observed by a mass on a spring. In other words, the interceptor moves as if it were connected to the target by a spring, which gives some insight into why orbital rendezvous is so counter-intuitive.

While this solution is certainly special, it shares many features with the general solution. Denoting the initial position by $(x_0, y_0)$ and the initial velocity by $(\dot{x}_0, \dot{y}_0)$, one can check that the general solution is \begin{aligned} x(t) & = \left(6y_0 + \frac{4\dot{x}_0}{\omega_0}\right) \sin(\omega_0 t) + \frac{2\dot{y}_0}{\omega_0}\cos(\omega_0t) + \left[x_0 – \frac{2\dot{y}_0}{\omega_0} – (3\dot{x}_0 + 6\omega_0y_0)t\right] \\ y(t) & = -\left(3y_0 + \frac{2\dot{x}_0}{\omega_0}\right)\cos(\omega_0 t) + \frac{\dot{y}_0}{\omega_0}\sin(\omega_0t) + \left[4y_0 + \frac{2\dot{x}_0}{\omega_0}\right] \end{aligned}

These expressions, known as the Clohessy-Wiltshire equations, initially look complicated, but a little study reveals a suprising simplicity. Defining \begin{aligned} C & = 3y_0 + \frac{2\dot{x}_0}{\omega_0} \\ D & = \frac{\dot{y}_0}{\omega_0} \\ x_c & = x_0 – \frac{2\dot{y}_0}{\omega_0} – (3\dot{x}_0 + 6\omega_0y_0)t \\ y_c & = 4y_0 + \frac{2\dot{x}_0}{\omega_0}, \end{aligned} enables us to write \begin{aligned} x(t) = 2C\sin(\omega_0 t) + 2D\cos(\omega_0 t) + x_c \\ y(t) = -C\cos(\omega_0 t) + D\sin(\omega_0 t) + y_c. \end{aligned}

This shows that the interceptor, once again, is traveling in an elliptical path centered about the point $(x_c, y_c)$. In fact, the semi-major axis $a$ and semi-minor axis $b$ of the ellipse are \begin{aligned} a & = 2\sqrt{C^2+D^2} \\ b & = a/2, \end{aligned} which shows that the eccentricity of the path is always $\sqrt{3}/2$.

Notice, however, that the point $(x_c, y_c)$ is moving since $x_c$ has velocity $$v_{\mbox{drift}} = – (3\dot{x}_0 + 6\omega_0y_0)t = -\frac32\omega_0 y_c.$$ This is important because it says that the center of the ellipse drifts left if $y_c \gt 0$ and drifts right if $y_c \lt 0$.

### Parking orbits

Here are some special orbits that Mueller calls parking orbits.

Remember that the orbit is determined by its initial conditions $(x_0,y_0)$ and $(\dot{x}_0,\dot{y}_0)$. With the right choice of initial conditions, we can arrange that $C=D=0$ so that the ellipse collapses to a point. Mueller’s type I parking orbit has the interceptor in a circular orbit, as seen in the fixed frame, at the same altitude as the target. In the rotating frame, the interceptor is not moving relative to the target.

A type III parking orbit has the interceptor again in a circular orbit but at a different altitude than the target. Since the angular velocity decreases with the altitude, the interceptor appears to move horizontally in the rotating frame. The Clohessy-Wiltshire equations explain this by noting that the center of the ellipse moves left when $y_c\gt0$ and right when $y_c\lt0$.

Things get a little more interesting if we allow $C$ and $D$ to be nonzero so that the interceptor moves on an elliptical trajectory in the rotating frame. A type II parking orbit has $y_c = 0$ so that the center of the ellipse is centered on the $x$-axis. In this case, the center does not move so the elliptical orbit is stationary. The special solution we looked at earlier is an example. The orbit depicted here shows how an astronaut could retrieve a tool accidentally dropped by just waiting one orbital period for its return.

A type IV orbit has the interceptor traveling an elliptical orbit whose center is drifting. The orbit shown has $y_c \gt 0$ so the center of the ellipse is moving to the left.

With any solution, one can check that the interceptor moves as if connected by a spring to the moving center of the ellipse, which again speaks to the counter-intuitive nature of the motion.

These orbits explain the behavior of Gemini IV, which we described in the introduction. Imagine that the interceptor is initially in a type I parking orbit, traveling in the same circular orbit as the target and 500 meters away.

McDivitt aimed at the target and fired Gemini’s thruster, which would make $\dot{x}_0 \gt 0$ and move the craft into the type IV orbit shown here. This explains why the target paradoxically appeared to move down and away from the capsule.

One scenario Carroll considers has an astronaut stranded outside the ISS at an initial position of $(100,100)$, which is 100 meters in front of the space station and 100 meters above. Applying a momentary thrust to create a velocity of 1 meter per second aimed directly at the space station leads to a type IV orbit that badly misses the station.

So how can we put the interceptor on a path to dock with the target? We’ll assume the interceptor is at an initial position $(x_0, y_0)$ and choose a time $T$ at which we’d like to dock. Setting $x(T) = 0$ and $y(T) = 0$ gives the equations \begin{aligned} x(T) = 0 & = \left(6y_0 + \frac{4\dot{x}_0}{\omega_0}\right) \sin(\omega_0 T) + \frac{2\dot{y}_0}{\omega_0}\cos(\omega_0T) + \left[x_0 – \frac{2\dot{y}_0}{\omega_0} – (3\dot{x}_0 + 6\omega_0y_0)T\right] \\ y(T) = 0 & = -\left(3y_0 + \frac{2\dot{x}_0}{\omega_0}\right)\cos(\omega_0 T) + \frac{\dot{y}_0}{\omega_0}\sin(\omega_0T) + \left[4y_0 + \frac{2\dot{x}_0}{\omega_0}\right] \end{aligned}

Once again, these equations appear complicated, but notice that we know everything except the components of the initial velocity. This means that these are two linear equations for the two components of the unknown initial velocity $(\dot{x}_0, \dot{y}_0)$. Once we have found those components, the interceptor applies a momentary thrust to change its velocity to $(\dot{x}_0, \dot{y}_0)$ bringing it the target’s position at time $T$.

Carroll’s paper shows how to apply this idea to reproduce the rendezvous technique used in the Apollo 11 mission as Neil Armstrong and Buzz Aldrin returned from the lunar surface to dock with Michael Collins in the command module.

### Summary

NASA learned from Gemini IV’s failure to rendezvous with the booster target, and several months later, the Gemini VI mission successfully navigated to within 130 feet of Gemini VII. The pilot of Gemini VI, Walter Schirra, later explained that the difficulty was getting to that distance but that completing docking from there is relatively straightforward. Carroll’s paper examines that claim and confirms that line-of-sight flying is effective within about 100 feet of the target.

The International Space Station in 2021.

In the past, spacecraft docking with the International Space Station would fly in formation with the station, which would grab the craft with the Canadarm2 and manually complete the docking procedure. Recently, however, SpaceX has automated the docking process so that software guides the craft rather than astronauts who are not involved at all.

Would you like to try to dock with the ISS? This SpaceX simulator gives you the chance.

# Lost (and found) in space

There are standard references, such as the Sloan Digital Sky Survey, that provide an atlas for the stars. What's needed is a way to search through this enormous atlas to find the view presented by a given image.

David Austin
Grand Valley State University

My son recently sent me this picture of the comet Neowise that he took from southern California in the summer of 2020.

Photo credit: Sam Austin

I had seen Neowise and knew roughly where it appeared, but being a comet, of course, its position changed over time. I wondered if there was a way to locate precisely where it was at the particular instant this picture was taken.

Well, there's an app for that. I uploaded the image to Astrometry.net and learned 36.1533 seconds later that we were looking here:

The image on the left places the picture on the celestial sphere, an imagined distant sphere centered at Earth. The coordinates depict declination (Dec), the angular distance north or south of the celestial equator, a projection of Earth's equator, and right ascension (RA), the angular distance east of the First Point of Ares, a reference point on the celestial equator. A closer view appears on the right where we see that we're looking at a portion of the constellation Ursa Major, which may be more familiar as the Big Dipper.

Here are some more details:

 Center (RA, Dec):  (135.836, 55.212)  Center (RA, hms):  09h 03m 20.603s  Center (Dec, dms):  +55° 12' 44.873"  Size:  16 x 10.7 deg  Radius:  9.616 deg  Pixel scale:  9.6 arcsec/pixel  Orientation:  Up is 313 degrees E of N

Astrometry.net also labels other features in the picture:

Measuring the positions and motions of celestial bodies falls to the branch of astronomy known as astrometry, and the problem of placing an image into a standard reference, as illustrated above, is called calilbrating the image's astrometry. The work of Astrometry.net, calibrating images without human intervention, is known as auto-calibration.

There are a few reasons why this is an important tool. First, calibrating an image taken by a camera mounted on a spacecraft in a fixed pose can be used to automate the determination of the spacecraft's orientation. Similarly, image calibration can be used to help a telescope track a target as Earth rotates.

Perhaps more importantly, decades of research have created a vast number of astronomical images. Sharing these images with a wide community of researchers is hampered by the fact that the meta-data surrounding these images is often inaccessible, of poor quality, or presented in a variety of formats. Automating the calibration process allows one to construct high quality meta-data in a standard format, enabling astronomical data to be shared and used more easily.

In broad terms, auto-calibration is a search problem similar to Google's internet search engine. There are standard references, such as the Sloan Digital Sky Survey, that provide an atlas for the stars. What's needed is a way to search through this enormous atlas to find the view presented by a given image.

Let's go behind the scenes to understand how Astrometry.net performs this search. There are really two phases. We begin with a known reference, like the Sloan Survey, and create an "index," as we'll soon describe. We only need to do this once so we don't mind committing time and computational power to this task. Next, we develop a means of searching the index to locate a specific image.

The basic idea is to detect certain features in an image and describe them in a way that is both independent of the orientation and scale of the image and easily searchable. The features we will use are configurations of four stars, called quads.

Given an astronomical image, it's fairly straightforward to pick out the brightest stars using standard image processing software. Using the OpenCV software library, I was able to pick out the 100 brightest stars from our original image. To make them more apparent, I reversed the image so that brighter parts of the original image appear darker here.

We will construct a description of quads, configurations of four stars, that is independent of their orientation and scale in the image. To illustrate, here's a quad with stars labeled $A$, $B$, $C$, and $D$.

We label the four stars so that $A$ and $B$ are separated by the greatest distance. We then use those two stars to form an orthogonal coordinate system with $A$ at the origin and $B$ at $(1,1)$. We only consider quads for which the other two stars, $C$ and $D$, are within the unit square in this coordinate system.

If $(x_C, y_C)$ and $(x_D, y_D)$ are the coordinates of the interior stars, we associate the four-dimensional point $(x_C, y_C, x_D, y_D)$ to the quad.

Of course, there's some ambiguity in how we label $A$ and $B$. Swapping the labels on $A$ and $B$ performs the transformation: $$(x_C, y_C, x_D, y_D) \mapsto (1-x_C, 1-y_C, 1-x_D, 1-y_D).$$ We choose the labeling of $A$ and $B$ that gives $x_C + x_D \leq 1$. There are also two choices for how we label $C$ and $D$, and we choose the one with $x_C\leq x_D$. With these choices, the point $(x_C, y_C, x_D, y_D)$ is uniquely associated to the quad. For instance, the quad above is uniquely associated to $(0.36367, 0.53644, 0.61088, 0.32359)$.

This association works well for our application since it doesn't change if the quad appears in a different orientation. For instance, if the quad is translated, rotated, or scaled, we still obtain the same four-dimensional point $(x_C, y_C, x_D, y_D)$.

To create an index, Astrometry.net chooses a collection of quads from a reference, such as the Sloan Survey, that covers the entire sky. As we'll see next, representing quads as four-dimensional points allows us to easily search for specific quads. When presented with an image to be calibrated, we find a set of quads in the image and then search the index for them. When we find a match, it's straightforward to construct the coordinate transformation between the pixels of the image and the celestial sphere.

### kd-Trees

Due to noise present in astronomical images, we cannot expect that the four-dimensional point associated to a quad obtained from our image will exactly match a point in the index. Images are distorted, for instance, by the particular optics of a telescope and by the refraction of Earth's atmosphere. If we're given a quad, what we'd really like to do is find all the nearby quads and consider each as a potential match.

But how can we efficiently search through all the quads to find nearby quads? We organize our four-dimensional points into a $kd$-tree. We'll illustrate by organizing the following collection of two-dimensional points into a binary search tree.

Any set of points sits inside a bounding box, the smallest rectangle that contains the points and whose sides are parallel to the coordinate axes. The set of all points and their bounding box constitutes the root of the tree.

Next, we determine the dimension along which the points are most widely separated. In our example, the points are more spread out in the horizontal direction so we divide the points into two equal sets at the median of the horizontal coordinates. The points to the left and their bounding box form the left child of the root and the points to the right form the right child.

Now, continue subdividing until each point lies in a single node. The resulting tree structure is known as a $2d$-tree due to the two-dimensional nature of the point set.

Suppose that we are presented with a new point $p$ and a distance $r$, and we'd like to find all the points in our original set that are within a distance $r$ of $p$.

Given a bounding box, we can compute the minimum distance from $p$ to the bounding box.

To begin the search, start with the root of the tree and ask whether the minimum distance from $p$ to the root's bounding box is less than $r$. If not, there are no points in the point set within a distance $r$ of $p$, and our search concludes. If the minimum distance is less than $r$, then we ask the same question of the two children and continue the search down the tree.

In this way, we eventually find all the points within a distance $r$ of $p$. In the example illustrated above, for instance, we have searched through 150 points by examining only 14 bounding boxes to find five points within the given distance.

This is an example of a $2d$-tree. Astronmetry.net constructs an index by organizing the quads present in the reference into a searchable $4d$-tree.

### Verifying a match

Once we have identified a quad in our image, we search through the $4d$-tree for quads that are within a reasonable tolerance. Since the index does not contain every quad, it's possible we do not find any quads that match. However, there are plenty of quads in our image so we continue with another. Usually, there will be a large number of possible matches so we need to determine which is the best.

A quad in the index that is close to a quad in our image produces an "alignment" of the image onto the celestial sphere, as returned by my search.

With many nearby quads, how do we know which to accept? If a quad produces the correct alignment, we will have two lists of stars, one in the image and one in the reference, and a correspondence between them. Therefore, we search through a second $kd$-tree to find stars that are near to those in the reference quad and check to see if the corresponding stars are present in the image. Of course, even with the correct alignment, it's possible that stars do not appear in their expected location as they may be occluded by a planet, satellite, or comet or they may not be correctly identified as stars when the image is processed.

Due to the probabilistic nature of this question, the choice of whether to accept a proposed alignment is made using Bayesian decision theory. This enables Astrometry.net to set a high threshold for accepting a potential match, which eliminates most false positives. In tests, Astrometry.net is able to correctly recognize 99.9% of testing images with no false positives.

### Summary

The central idea that leads to the impressive results achieved by Astrometry.net is the ability to identify certain features, quads, that can be expressed in a form that is easily searchable.

One can imagine other choices. For instance, using configurations of three stars, "tris," would allow us to represent features using 2-dimensional points. Our index would then squeeze all of our reference features into the unit square. Remember that noise in the images we are processing means that there is uncertainty in the coordinates of these features. Suppose, for instance, that we can only guarantee that a coordinate lies in an interval of width 0.1. The tri then lies somewhere in a square of area 0.01, which is 1% of the area of the unit square. We can detect at most 100 distinct tris.

Because quads live in 4-dimensional space, however, a quad would only occupy a volume of 0.0001 under these assumptions, which means we could distinguish 10,000 distinct quads.

Clearly, using a configuration of five stars would enable us to distinguish an even greater number of configurations. The trade-off, however, is that searching the $kd$-tree takes longer because we have more coordinates to work with. The choice to use quads is a compromise between the efficiency of the search and the number of nearby quads, and resulting proposed alignments, that need to be searched.

Astrometry.net, created by Dustin Lang, David Hogg, Keir Mierle, Michael Blanton, and Sam Roweis, enables a vast collection of images, including those made by amateur astronomers, to be used in astronomical research. What's presented here is a fairly broad overview that skims over many details of the implementation. Readers wanting to know more are encouraged to read the collaborators' paper or Lang's thesis, which presents a complete picture.

# Pooling strategies for COVID-19 testing

How could 10 tests yield accurate results for 100 patients?

David Austin
Grand Valley State University

While the COVID-19 pandemic has brought tragedy and disruption, it has also provided a unique opportunity for mathematics to play an important and visible role in addressing a pressing issue facing our society.

By now, it’s well understood that testing is a crucial component of any effective strategy to control the spread of the SARS-CoV-2 coronavirus. Countries that have developed effective testing regimens have been able, to a greater degree, to resume normal activities, while those with inadequate testing have seen the coronavirus persist at dangerously high levels.

Developing an effective testing strategy means confronting some important challenges. One is producing and administering enough tests to form an accurate picture of the current state of the virus’ spread. This means having an adequate number of trained health professionals to collect and process samples along with an adequate supply of reagents and testing machines. Furthermore, results must be available promptly. A person is who unknowingly infected can transmit the virus to many others in a week, so results need to be available in a period of days or even hours.

One way to address these challenges of limited resources and limited time is to combine samples from many patients into testing pools strategically rather than testing samples from each individual patient separately. Indeed, some well-designed pooling strategies can decrease the number of required tests by a factor of ten; that is, it is possible to effectively test, say, 100 patients with a total of 10 tests.

On first thought, it may seem like we’re getting something from nothing. How could 10 tests yield accurate results for 100 patients? This column will describe how some mathematical ideas from compressed sensing theory provide the key.

One of the interesting features of the COVID-19 pandemic is the rate at which we are learning about it. The public is seeing science done in public view and in real time, and new findings sometimes cause us to amend earlier recommendations and behaviors. This has made the job of communicating scientific findings especially tricky. So while some of what’s in this article may require updating in the near future, our aim is rather to focus on mathematical issues that should remain relevant and trust the reader to update as appropriate.

### Some simple pooling strategies

While the SARS-CoV-2 virus is new, the problem of testing individuals in a large population is not. Our story begins in 1943 when Robert Dorfman proposed the following simple method for identifying syphilitic men called up for induction through the war time draft.

A 1941 poster encourages syphilis testing (Library of Congress)

Suppose we have samples from, say, 100 patients. Rather than testing each of the samples individually, Dorfman suggested grouping them into 10 pools of 10 each and testing each pool.

If the test result of a pool is negative, we conclude that everyone in that pool is free of infection. If a pool tests positively, then we test each individual in the pool.

In the situation illustrated above, two of the 100 samples are infected, so we perform a total of 30 tests to identify them: 10 tests for the original 10 pools followed by 20 tests for each member of the two infected pools. Here, the number of tests performed is 30% of the number required had we tested each individual separately.

Of course, there are situations where this strategy is disadvantageous. If there is an infected person in every pool, we end up performing the original 10 tests and follow up by then testing each individual. This means we perform 110 tests, more than if we had just tested everyone separately.

What’s important is the prevalence $p$ of the infection, the expected fraction of infected individuals we expect to find or the probability that a random individual is infected. If the prevalence is low, it seems reasonable that Dorfman’s strategy can lead to a reduction in the number of tests we expect to perform. As the prevalence grows, however, it may no longer be effective.

It’s not too hard to find how the expected number of tests per person varies with the prevalence. If we arrange $k^2$ samples into $k$ pools of $k$ samples each, then the expected number of tests per person is $$E_k = \frac1k + (1-(1-p)^k).$$ When $k=10$, the expected number $E_{10}$ is shown on the right. Of course, testing each individual separately means we use one test per person so Dorfman’s strategy loses its advantage when $E_k\geq 1$. As the graph shows, when $p>0.2$, meaning there are more than 20 infections per 100 people, we are better off testing each individual.

Fortunately, the prevalence of SARS-CoV-2 infections is relatively low in the general population. As the fall semester began, my university initiated a randomized testing program that showed the prevalence in the campus community to be around $p\approx 0.01$. Concern is setting in now that that number is closer to 0.04. In any case, we will assume that the prevalence of infected individuals in our population is low enough to make pooling a viable strategy.

Of course, no test is perfect. It’s possible, for instance, that an infected sample will yield a negative test result. It’s typical to characterize the efficacy of a particular testing protocol using two measures: sensitivity and specificity. The sensitivity measures the probability that a test returns a positive result for an infected sample. Similarly, the specificity measures the probability that a test returns a negative result when the sample is not infected. Ideally, both of these numbers are near 100%.

Using Dorfman’s pooling method, the price we pay for lowering the expected number of tests below one is a decrease in sensitivity. Identifying an infected sample in this two-step process requires the test to correctly return a positive result both times we test it. Therefore, if $S_e$ is the sensitivity of a single test, Dorfman’s method has a sensitivity of $S_e^2$. For example, a test with a sensitivity of 95% yields a sensitivity around 90% when tests are pooled.

There is, however, an increase in the specificity. If a sample is not infected, testing a second time increases the chance that we detect it as such. One can show that if the sensitivity and specificity are around 95% and the prevalence at 1%, then pooling 10 samples, as shown above, raises the specificity to around 99%.

### Some modifications of Dorfman’s method

It’s possible to imagine modifying Dorfman’s method in several ways. For instance, once we have identified the infected pools in the first round of tests, we could apply a pooling strategy on the smaller set of samples that still require testing.

A second possibility is illustrated below where 100 samples are imagined in a square $10\times10$ array. Each sample is included in two pools according to its row and column so that a total of 20 tests are performed in the first round. In the illustration, the two infected samples lead to positive results in four of the pools, two rows and two columns.

We know that the infected samples appear at the intersection of these two rows and two columns, which leads to a total of four tests in the second round. Once again, it’s possible to express $E_k$, the number of expected tests per individual in terms of the prevalence $p$. If we have $k^2$ tests arranged in a $k\times k$ array, we see that $$E_k =\frac2k + p + (1-p)(1-(1-p)^{k-1},$$ if we assume that the sensitivity and specificity are 100%.

The graph at right shows the expected number of tests using the two-dimensional array, assuming $k=10$, in red with the result using Dorfman’s original method in blue. As can be seen, the expected number of tests is greater using the two-dimensional approach since we invest twice as many tests in the first round of testing. However each sample is included in two tests in the initial round. For an infected sample to be misidentified, both tests would have to return negative results. This means that the two-dimensional approach is desirable because the sensitivity of this strategy is greater than the sensitivity of the individual tests and we still lower the expected number of tests when the prevalence is low.

While it is important to consider the impact that any pooling strategy has on these important measures, our focus will, for the most part, take us away from discussions of specificity and sensitivity. See this recent Feature column for a deep dive into their relevance.

### Theoretical limits

There has been some theoretical work on the range of prevalence values over which pooling strategies are advantageous. In the language of information theory, we can consider a sequence of test results as an information source having entropy $I(p) = -p\log_2(p) – (1-p)\log_2(1-p)$. In this framework, a pooling strategy can be seen as an encoding of the source to effectively compress the information generated.

Sobel and Groll showed that $E$, the expected number of tests per person, for any effective pooling method must satisfy $E \geq I(p)$. On the right is shown this theoretical limit in red along with the expected number of tests under the Dorfman method with $k=10$.

Further work by Ungar showed that when the prevalence grows above the threshold $p\geq (3-\sqrt{5})/2 \approx 38\%$, then we cannot find a pooling strategy that is better than simply testing everyone individually.

### RT-qPCR testing

While there are several different tests for the SARS-CoV-2 virus, at this time, the RT-qPCR test is considered the “gold standard.” In addition to its intrinsic interest, learning how this test works will help us understand the pooling strategies we will consider next.

A sample is collected from a patient, often through a nasal swab, and dispersed in a liquid medium. The test begins by converting the virus’ RNA molecules into complementary DNA through a process known as reverse transcription (RT). A sequence of amplification cycles, known as quantitative polymerase chain reaction (qPCR) then begins. Each cycle consists of three steps:

• The liquid is heated close to boiling so that the transcribed DNA denatures into two separate strands.
• Next the liquid is cooled so that a primer, which has been added to the liquid, attaches to a DNA strand along a specific sequence of 100-200 nucleotides. This sequence characterizes the complementary DNA of the SARS-CoV-2 virus and is long enough to uniquely identify it. This guarantees that the test has a high sensitivity. Attached to the primer is a fluorescent marker.
• In a final warming phase, additional nucleotides attach to the primer to complete a copy of the complementary DNA molecule.

Through one of these cycles, the number of DNA molecules is essentially doubled.

The RT-qPCR test takes the sample through 30-40 amplification cycles resulting in a significant increase in the number of DNA molecules, each of which has an attached fluorescent marker. After each cycle, we can measure the amount of fluorescence and translate it into a measure of the number of DNA molecules that have originated from the virus.

The fluorescence, as it depends on the number of cycles, is shown below. A sample with a relatively high viral load will show significant fluorescence at an early cycle. The curves below represent different samples and show how the measured fluorescence grows through the amplification cycles. Moving to the right, each curve is associated with a ten-fold decrease in the initial viral load of the sample. Taken together, these curves represent a range of a million-fold decrease in the viral load. In fact, the test is sensitive enough to detect a mere 10 virus particles in a sample.

The FDA has established a threshold, shown as the red horizontal line, above which we can conclude that the SARS-CoV-2 virus is present in the original sample. However, the test provides more than a binary positive/negative result; by matching the fluorescence curve from a particular sample to the curves above, we can infer the viral load present in the original sample. In this way, the test provides a quantitative measure of the viral load that we will soon use in developing a pooling method.

Pooling samples from several individuals, only one of whom is infected, will dilute the infected sample. The effect is simply that the fluorescence response crosses the FDA threshold in a later cycle. There is a limit, however. Because noise can creep into the fluorescence readings around cycle 40, FDA standards state that only results from the first 39 cycles are valid.

Recent studies by Bilder and Yelin et al investigated the practical limits of pooling samples in the RT-qPCR test and found that a single infected sample can be reliably detected in a pool of up to 32. (A recent study by the CDC, however, raises concerns about detecting the virus using the RT-qPCR test past the 33rd amplification cycle. )

Dorfman’s pooling method and its variants described above are known as adaptive methods because they begin with an initial round of tests and use those results to determine how to proceed with a second round. Since the RT-qPCR test requires 3 – 4 hours to complete, the second round of testing causes a delay in obtaining results and ties up testing facilities and personnel. A non-adaptive method, one that produces results for a group of individuals in a single round of tests, would be preferable.

Several non-adaptive methods have recently been proposed and are even now in use, such as P-BEST. The mathematical ideas underlying these various methods are quite similar. We will focus on one called Tapestry.

We first collect samples from $N$ individuals and denote the viral loads of each sample by $x_j$. We then form these samples into $T$ pools in a manner to be explained a little later. This leads to a pooling matrix $A_i^j$ where $A_i^j = 1$ if the sample from individual $j$ is present in the $i^{th}$ test and 0 otherwise. The total viral load in the $i^{th}$ test is then $$y_i = \sum_{j} A_i^j x_j,$$ which can be measured by the RT-qPCR test. In practice, there will be some uncertainty in measuring $y_i$, but it can be dealt with in the theoretical framework we are describing.

Now we have a linear algebra problem. We can express the $T$ equations that result from each test as $$\yvec = A\xvec,$$ where $\yvec$ is the known vector of test results, $A$ is the $T\times N$ pooling matrix, and $\xvec$ is the unknown vector of viral loads obtained from the patients.

Because $T\lt N$, this is an under-determined system of equations, which means that we cannot generally expect to solve for the vector $\xvec$. However, we have some additional information: because we are assuming that the prevalence $p$ is low, the vector $\xvec$ will be sparse, which means that most of its entries are zero. This is the key observation on which all existing non-adaptive pooling methods rely.

It turns out that this problem has been extensively studied within the area of compressed sensing, a collection of techniques in signal processing that allow one to reconstruct a sparse signal from a small number of observations. Here is an outline of some important ideas.

First, we will have occassion to consider a couple of different measures of the size of a vector.

• First, $\norm{\zvec}{0}$ is the number of nonzero entries in the vector $\zvec$. Because the prevalence of SARS-CoV-2 positive samples is expected to be small, we are looking for a solution to the equation $\yvec=A\xvec$ where $\norm{\xvec}{0}$ is small.
• The 1-norm is $$\norm{\zvec}{1} = \sum_j~|z_j|,$$
• and the 2-norm is the usual Euclidean length: $$\norm{\zvec}{2} = \sqrt{\sum_j z_j^2}$$

Remember that an isometry is a linear transformation that preserves the length of vectors. With the usual Euclidean length of a vector $\zvec$ written as $||\zvec||_2$, then the matrix $M$ defines an isometry if $||M\zvec||_2 = ||\zvec||_2$ for all vectors $\zvec$. The columns of such a matrix form an orthonormal set.

We will construct our pooling matrix $A$ so that it satisfies a restricted isometry property (RIP), which essentially means that small subsets of the columns of $A$ are almost orthonormal. More specifically, if $R$ is a subset of $\{1,2,\ldots, N\}$, we denote by $A^R$ the matrix formed by pulling out the columns of $A$ labelled by $R$; for instance, $A^{\{2,5\}}$ is the matrix formed from the second and fifth columns of $A$. For a positive integer $S$, we define a constant $\delta_S$ such that $$(1-\delta_S)||\xvec||_2 \leq ||A^R\xvec||_2 \leq (1+\delta_S)||\xvec||_2$$ for any set $R$ whose cardinality is no more than $S$. If $\delta_S = 0$, then the matrices $A^R$ are isometries, which would imply that the columns of $A^R$ are orthonormal. More generally, the idea is that when $\delta_S$ is small, then the columns of $A^R$ are close to being orthonormal.

Let’s see how we can use these constants $\delta_S$.

Because we are assuming that the prevalence $p$ is low, we know that $\xvec$, the vector of viral loads, is sparse. We will show that a sufficiently sparse solution to $\yvec = A\xvec$ is unique.

For instance, suppose that $\delta_{2S} \lt 1$, that $\xvec_1$ and $\xvec_2$ are two sparse solutions to the equation $\yvec = A\xvec$, and that $\norm{\xvec_1}{0}, \norm{\xvec_1}{0} \leq S$. The last condition means that $\xvec_1$ and $\xvec_2$ are sparse in the sense that they have fewer than $S$ nonzero components.

Now it follows that $A\xvec_1 = A\xvec_2 = \yvec$ so that $A(\xvec_1-\xvec_2) = 0$. In fact, if $R$ consists of the indices for which the components of $\xvec_1-\xvec_2$ are nonzero, then $A^R(\xvec_1-\xvec_2) = 0$.

But we know that the cardinality of $R$ equals $\norm{\xvec_1-\xvec_2}{0} \leq 2S$, which tells us that $$0 = ||A^R(\xvec_1-\xvec_2)||_2 \geq (1-\delta_{2S})||\xvec_1-\xvec_2||_2.$$ Because $\delta_{2S}\lt 1$, we know that that $\xvec_1 – \xvec = 0$ or $\xvec_1 = \xvec_2$.

Therefore, if $\delta_{2S} \lt 1$, any solution to $\yvec=A\xvec$ with $\norm{\xvec}{0} \leq S$ is unique; that is, any sufficiently sparse solution is unique.

Now that we have seen a condition that implies that sparse solutions are unique, we need to explain how we can find sparse solutions. Candès and Tao show, assuming $\delta_S + \delta_{2S} + \delta_{3S} \lt 1$, how we can find a sparse solution to $\yvec = A\xvec$ with $\norm{\xvec}{0} \leq S$ by minimizing: $$\min \norm{\xvec}{1} ~~~\text{subject to}~~~ \yvec = A\xvec.$$ This is a convex optimization problem, and there are standard techniques for finding the minimum.

Why is it reasonable to think that minimizing $\norm{x}{1}$ will lead to a sparse solution? Let’s think visually about the case where $\xvec$ is a 2-dimensional vector. The set of all $\xvec$ satisfying $\norm{\xvec}{1} = |x_1| + |x_2| \leq C$ for some constant $C$ is the shaded set below:

Notice that the corners of this set fall on the coordinate axes where some of the components are zero. If we now consider solutions to $\yvec=A\xvec$, seen as the line below, we see that the solutions where $\norm{\xvec}{1}$ is minimal fall on the coordinates axes. This forces some of the components of $\xvec$ to be zero and results in a sparse solution.

This technique is related to one called the lasso (least absolute shrinkage and selection operator), which is well known in data science where it is used to eliminate unnecessary features from a data set.

All that remains is for us to find a pooling matrix $A$ that satisfies $\delta_{S} + \delta_{2S} + \delta_{3S} \lt 1$ for some $S$ large enough to find vectors $\xvec$ whose sparsity $\norm{\xvec}{0}$ is consistent with the expected prevalence. There are several ways to do this. Indeed, a matrix chosen at random will work with high probability, but the application to pooling SARS-CoV-2 samples that we have in mind leads us to ask that $A$ satisfy some additional properties.

The Tapestry method uses a pooling matrix $A$ formed from a Steiner triple system, an object studied in combinatorial design theory. For instance, one of Tapestry’s pooling matrices is shown below, where red represents a 1 and white a 0.

This is a $16\times40$ matrix, which means that we perform 16 tests on 40 individuals. Notice that each individual’s sample appears in 3 tests. This is a relatively low number, which means that the viral load in a sample is not divided too much and that, in the laboratory, time spent pipetting the samples is minimzed. Each test consists of samples from about eight patients, well below the maximum of 32 needed for reliable RT-qPCR readings.

It is also important to note that two samples appear together in at most one test. Therefore, if $A^j$ is the $j^{th}$ column of $A$, it follows that the dot product $A^j\cdot A^k = 0$ or $1$. This means that two columns are either orthogonal or span an angle of $\arccos(1/3) \approx 70^\circ$. If we scale $A$ by $1/\sqrt{3}$, we therefore obtain a matrix whose columns are almost orthonormal and from which we can derive the required condition, $\delta_S + \delta_{2S} + \delta_{3S} \lt 1$ for some sufficiently large value of $S$.

There is an additional simplification we can apply. For instance, if we have a sample $x_j$ that produces a negative result $y_i=0$ in at least one test in which it is included, then we can conclude that $x_j = 0$. This means that we can remove the component $x_j$ from the vector $\xvec$ and the column $A^j$ from the matrix $A$. Removing all these sure negatives often leads to a dramatic simplication in the convex optimization problem.

Tapestry has created a variety of pooling matrices that can be deployed across a range of prevalences. For instance, a $45\times 105$ pooling matrix, which means we perform 45 tests on 105 individuals, is appropriate when the prevalence is roughly 10%, a relatively high prevalence.

However, there is also a $93\times 961$ pooling matrix that is appropriate for use when the prevalence is around 1%. Here we perform 93 tests on 961 patients in pools of size 32, which means we can test about 10 times the number of patients with a given number of tests. This is a dramatic improvement over performing single tests on individual samples.

If the prevalence turns out to be too high for the pooling matrix used, the Tapestry algorithm detects it and fails gracefully.

Along with non-adaptive methods comes an increase in the complexity of their implementation. This is especially concerning since reliability and speed are crucial. For this reason, the Tapestry team built an Android app that guides a laboratory technician though the pipetting process, receives the test results $y_i$, and solves for the resulting viral loads $x_j$ returning a list of positive samples.

Using both simulated and actual lab data, the authors of Tapestry studied the sensitivity and specificity of their algorithm and found that it performs well. They also compared the number of tests Tapestry performs with Dorfman’s adaptive method and found that Tapestry requires many fewer tests, often several times fewer, in addition to finishing in a single round.

### Summary

As we’ve seen here, non-adaptive pooling provides a significant opportunity to improve our testing capacity by increasing the number of samples we can test, decreasing the amount of time it takes to obtain results, and decreasing the costs of testing. These improvements can play an important role in a wider effort to test, trace, and isolate infected patients and hence control the spread of the coronavirus.

In addition, the FDA recently gave emergency use authorization for the use of these ideas. Not only is there a practical framework for deploying the Tapestry method, made possible by their Android app, it’s now legal to do so.

Interestingly, the mathematics used here already existed before the COVID-19 pandemic. Dating back to Dorfman’s original work of 1943, group pooling strategies have continued to evolve over the years. Indeed, the team of Shental et al. introduced P-BEST, their SARS-CoV-2 pooling strategy, as an extension of their earlier work to detect rare alleles associated to some diseases.

### Thanks

Mike Breen, recently of the AMS Public Awareness Office, oversaw the publication of the monthly Feature Column for many years. Mike retired in August 2020, and I’d like to thank him for his leadership, good judgment, and never-failing humor.

# Transmitting Data with Polar Codes

This then is the significance of Arikan’s polar codes: they provide encodings for an important class of channels that enable us to transmit information at the greatest possible rate and with an arbitrarily small error rate. …

David Austin
Grand Valley State University

### Introduction

Wireless communication is currently transitioning to a new 5G standard that promises, among other advantages, faster speeds. One reason for the improvement, as we’ll explain here, is the use of polar codes, which were first introduced by Erdal Arikan in 2009 and which are optimal in a specific information-theoretic sense.

This column is a natural continuation of an earlier column that described Shannon’s theory of information. To quickly recap, we considered an information source $X$ to be a set of symbols $\cx$, such as letters in an alphabet, that are generated with a specific frequency $p(x)$. The amount of information generated by this source is measured by the Shannon entropy: $$H(X) = -\sum_x p(x)\lg p(x).$$

A simple example is an information source consisting of two symbols, such as 0, which is generated with probability $p$, and 1, generated with probability $1-p$. In this case, the Shannon entropy is $$H(p) = – p\lg(p) -(1-p)\lg(1-p).$$

We described $H(X)$ as a measure of the freedom of expression created by the source. If $p=1$, then the source can only generate a string of 0’s, so there is only one possible message created by this source; it therefore conveys no information. On the other hand, when $p=1/2$, each symbol is equally possible so any string of 0’s and 1’s is equally likely. There are many possible messages, each of which is equally likely, so a given message can contain a specific meaning, and we think of this as a high-information source.

Entropy is measured in units of bits per symbol. However, if we imagine that one symbol is generated per unit time, we may also think of entropy–measured in bits per unit time–as measuring the rate at which information is generated. We earlier saw how this interpretation helped us determine the maximum rate at which messages generated by the source could be transmitted over a noiseless channel, a fact that is particularly relevant for us here.

Polar codes were created to maximize the rate at which information can be transmitted through a noisy channel, one that may introduce errors in transmission. So before we introduce these codes, we will first describe how Shannon’s theory tells us the maximum rate at which information can be transmitted over a noisy channel,

### Transmission over a noisy channel

Let’s consider a channel $W$ through which we send a symbol $x$ and receive a symbol $y$. Ideally, we receive the same symbol that we send, or at least, the sent symbol can be uniquely recovered from the received symbol. In practice, however, this is not always the case.

For example, the following channel sends symbols from the set $\cx=\{0,1\}$ and receives symbols in $\cy=\{0,1\}$. However, there is a chance that the symbol is corrupted in transmission. In particular, the probability is $p$ that we receive the same symbol we send and $1-p$ that we receive the other symbol. Such a channel is called a binary symmetric channel, which we’ll denote as $BSC(p)$.

Our central problem is to understand the maximum rate at which information can be transmitted through such a channel and to find a way to do so that minimizes errors.

To this end, Shannon generalizes the concept of entropy. If $X$ is an information source whose symbols are sent through $W$ and $Y$ is the information source of received symbols, we have joint probabilities $p(x,y)$ that describe the frequency with which we send $x$ and receive $y$. There are also the conditional probabilities $p(y|x)$, the probability that we receive $y$ assuming we have sent $x$, and $p(x|y)$, the probability that we sent $x$ assuming we received $y$. These give conditional entropies \begin{aligned} H(Y|X) = & -\sum_{x,y} p(x,y)\lg p(y|x) \\ H(X|Y) = & -\sum_{x,y} p(x,y)\lg p(x|y). \end{aligned}

We are especially interested in $H(X|Y)$, which Shannon calls the channel’s equivocation. To understand this better, consider a fixed $y$ and form $$H(X|y) = -\sum_x p(x|y) \lg p(x|y),$$ which measures our uncertainty in the sent symbol $x$ if we have received $y$. The equivocation is the average of $H(X|y)$ over the received symbols $y$: $$H(X|Y) = \sum_y p(y) H(X|y).$$ Thinking of entropy as a measure of uncertainty, equivocation measures the uncertainty we have in reconstructing the sent symbols from the received. We can also think of it as the amount of information lost in transmission.

For example, suppose that our channel is $BSC(1)$, which means we are guaranteed that $y=x$, so the received symbol is the same as the transmitted symbol. Then $p(x|y) = 1$ if $y=x$ and $p(x|y) = 0$ otherwise, which leads us to conclude that $H(X|Y) = 0$. In this case, the equivocation is zero, and no information is lost.

On the other hand, working with $BSC(1/2)$, we are as likely to receive a 0 as we are a 1, no matter which symbol is sent. It is therefore impossible to conclude anything about the sent symbol so all the information is lost. A simple calculation shows that the equivocation $H(X|Y) = H(X)$.

The difference $H(X) – H(X|Y)$, which describes the amount of information generated minus the amount lost in transmission, measures the amount of information transmitted by the channel. Shannon therefore defines the capacity of a noisy channel $W$ to be the maximum $$I(W) = \max_X[H(X) – H(X|Y)]$$ over all information sources using the set of symbols $\cx$.

• For example, if $W = BSC(1)$, we have $H(X|Y) = 0$ so $$I(BSC(1)) = \max_X[H(X)] = 1,$$ which happens when the symbols 0 and 1 appear with equal frequency. The capacity of this channel is therefore 1 bit per unit time.
• However, if $W=BSC(1/2)$, we have $H(X|Y) = H(X)$ so $$I(BSC(1/2)) = \max_X[H(X) – H(X|Y)] = 0,$$ meaning this channel has zero capacity. No information can be transmitted through it.

The term capacity is motivated by what Shannon calls the Fundamental Theorem of Noisy Channels:

Suppose $X$ is an information source whose entropy is no more than the capacity of a channel $W$; that is, $H(X) \leq I(W)$. Then the symbols of $X$ can be encoded into $\cx$, the input symbols of $W$, so that the source can be transmitted over the channel with an arbitrarily small error rate. If $H(X) \gt I(W)$, there is no such encoding.

This result may initially appear surprising: how can we send information over a noisy channel, a channel that we know can introduce errors, with an arbitrarily small rate of errors? As an example, consider the channel $W$ with inputs and outputs $\cx=\cy=\{a,b,c,d\}$ and whose transmission probabilities are as shown:

Remembering that $I(W) = \max_X[H(X) – H(X|Y)]$, we find with a little work that the maximum occurs when the symbols of $X$ appear equally often so that $H(X) = 2$. It turns out that the equivocation $H(X|Y) = 1$, which implies that the capacity $I(W) = 1$ bit per unit time.

Suppose now that we have the information source whose symbols are $\{0,1\}$, where each symbol occurs with equal frequency. We know that $H(X) = 1$ so Shannon’s theorem tells us that we should be able to encode $\{0,1\}$ into $\{a,b,c,d\}$ with an arbitrarily small error rate. In fact, the encoding \begin{aligned} 0 & \to a \\ 1 & \to c \end{aligned} accomplishes this goal. If we receive either $a$ or $b$, we are guaranteed that $0$ was sent; likewise, if we receive either $c$ or $d$, we know that $1$ was sent. It is therefore possible to transmit the information generated by $X$ through $W$ without errors.

This example shows that we can use a noisy channel to send error-free messages by using a subset of the input symbols. Another technique for reducing the error rate is the strategic use of redundancy. As a simple example, suppose our channel is $BSC(0.75)$; sending every symbol three times allows us to reduce the error rate from 25% to about 14%.

Unfortunately, the proof of Shannon’s theorem tells us that an encoding with an arbitrarily small error rate exists, but it doesn’t provide a means of constructing it. This then is the significance of Arikan’s polar codes: they provide encodings for an important class of channels that enable us to transmit information at the greatest possible rate and with an arbitrarily small error rate. As we will see, the strategy for creating these codes is a creative use of redundancy.

### Symmetric, binary-input channels

Before describing Arikan’s polar codes, let’s take a moment to clarify the type of channels $W:\cx\to\cy$ we will be working with. The inputs are $\cx=\{0,1\}$ and we require the channel to be symmetric, which means there is a permutation $\pi:\cy\to\cy$ such that $\pi^{-1} = \pi$ and $p(\pi(y)|1) = p(y|0)$. Such a channel is called a symmetric, binary-input channel. The symmetry condition simply ensures that the result of transmitting 0 is statistically equivalent to transmitting 1.

A typical example is the binary symmetric channel $BSC(p):\{0,1\}\to\{0,1\}$ that we described earlier. The symmetry condition is met with the permutation $\pi$ that simply interchanges 0 and 1.

There are two quantities associated to a symmetric channel $W$ that will be of interest.

• The first, of course, is the capacity $I(W)$, which measures the rate at which information can be transmitted through the channel. It’s not hard to see that the capacity $$I(W) = \max_X[H(X) – H(X|Y)]$$ is found with the information source $X$ that generates 0 and 1 with equal probability. This means that $p(x) = 1/2$ for both choices of $x$.

Remembering that conditional probabilities are related by $$p(x|y)p(y) = p(x,y) = p(y|x)p(x),$$ we have $$\frac{p(x)}{p(x|y)} = \frac{p(y)}{p(y|x)}.$$ This implies that \begin{aligned} -\sum_{x,y} p(x,y) & \lg p(x) + \sum_{x,y}p(x,y)\lg p(x|y) = \\ & -\sum_{x,y} p(x,y) \lg p(y) + \sum_{x,y}p(x,y)\lg p(y|x) \\ \end{aligned} and so $$H(X) – H(X|Y) = H(Y) – H(Y|X).$$

Since $p(x) = \frac12$, we also have that $p(y) = \frac12 p(y|0) + \frac12 p(y|1)$ and $p(x,y) = \frac12 p(y|x)$. This finally leads to the expression $$I(W) = \sum_{x,y} \frac12 p(y|x) \lg \frac{p(y|x)}{\frac12 p(y|0) + \frac12p(y|1)}.$$ This is a particularly convenient form for computing the capacity since it is expressed only in terms of the conditional probabilities $p(y|x)$.

In particular, for the binary symmetric channel, we find $$I(BSC(p)) = 1-H(p) = 1-p\lg(p) – (1-p)\lg(1-p),$$ reinforcing our earlier observation that $H(p)$ is a measure of the information lost in the noisy channel.

Since the capacity of a symmetric, binary input channel is computed as $\max_X[H(X) – H(X|Y)]$ where $H(X) \leq 1$, it follows that $0\leq I(W) \leq 1$. When $I(W) = 1$, the equivocation $H(Y|X) = 0$ so no information is lost and this is a perfect channel. In the example above, this happens when $p=0$ or $1$.

At the other extreme, a channel for which $I(W) = 0$ is useless since no information is transmitted.

• A second quantity associated to a symmetric channel $W$ measures the reliability with which symbols are transmitted. Given a received symbol $y$, we can consider the product $$p(y|0)~p(y|1)$$ as a reflection of the confidence we have in deducing the sent symbol when $y$ is received. For instance, if this product is 0, one of the conditional probabilities is zero, which means we know the sent symbol when we receive $y$.

We therefore define the Bhattacharyya parameter $$Z(W) = \sum_y \sqrt{p(y|0)~p(y|1)}$$ as a measure of the channel’s reliability. It turns out that $0\leq Z(W)\leq 1$, and lower values of $Z(W)$ indicate greater reliability. For instance, when $Z(W) = 0$, then every product $p(y|0)~p(y|1) = 0$, which means the symbols are transmitted without error.

We expect $Z(W)$, the channel’s reliability, to be related to $I(W)$, the rate of transmission. Indeed, Arikan shows that $Z(W)\approx 0$ means that $I(W)\approx 1$ and that $Z(W)\approx 1$ means that $I(W)\approx 0$. So a high-quality channel will have $I\approx 1$ and $Z\approx 0$, and a poor channel will have $I\approx 0$ and $Z\approx 1$.

For the binary symmetric channel $BSC(p)$, we see the following relationship:

Besides the binary symmetric channel, another example of a symmetric, binary-input channel is the binary erasure channel $W:\{0,1\}\to\{0,*,1\}$ as shown below. We think of the received symbol $*$ as an erasure, meaning the symbol was received yet we do not know what it is, so $\ep$ represents the probability that a sent symbol is erased.

Once again, the permutation $\pi$ that interchanges 0 and 1 while fixing $*$ confirms this as a symmetric channel that we denote as $BEC(\ep)$. The diagram gives the conditional probabilities that we need to find the capacity and Bhattacharyya parameter \begin{aligned} I(BEC(\ep)) & = 1-\ep \\ Z(BEC(\ep)) & = \ep. \end{aligned} Notice that if $\ep=0$, there is no possibility of an erasure and the capacity $I(BEC(0)) = 1$ bit per unit time. On the other hand, if $\ep=1$, every symbol is erased so we have $I(BEC(1)) = 0$ telling us this channel can transmit no information.

### Polarization: A first step

Polar codes are constructed through a recursive process, the first step of which we’ll now describe. Beginning with a channel $W$, we’ll bundle together two copies of $W$ into a vector channel $W_2:\cx^2\to\cy^2$. Then we’ll pull the vector channel apart into two symmetric, binary-input channels $\chan{1}{2}$ and $\chan{2}{2}$ and observe how the capacity of these two channels is distributed.

In what follows, we will be considering a number of different channels. We will therefore denote the conditional probabilities of a particular channel using the same symbol as the channel. For instance, if $W=BEC(\ep)$, we will write, say, the conditional probability $W(*|1)=\ep$.

Our vector channel $W_2:\cx^2\to\cy^2$ begins with a two-dimensional input $(u_1,u_2)$ and forms $(x_1,x_2) = (u_1\oplus u_2, u_2)$, where $\oplus$ denotes integer addition modulo 2. Then $x_1$ and $x_2$ are transmitted through $W$, as shown below.

This gives the transmission probabilities $$W_2((y_1,y_2) | (u_1,u_2)) = W(y_1|u_1\oplus u_2)W(y_2|u_2).$$ It is fairly straightforward to see that $I(W_2) = 2I(W)$ so that we have not lost any capacity.

From here, we obtain two channels: \begin{aligned} \chan{1}{2}&: \cx\to\cy^2 \\ \chan{2}{2}&: \cx\to\cy^2\times \cx. \end{aligned} whose transition probabilities are defined by \begin{aligned} \chan{1}{2}((y_1,y_2)|u_1) & = \frac12 \left(W_2((y_1,y_2)|(u_1,0)) + W_2((y_1,y_2)|(u_1,1))\right) \\ \chan{2}{2}(((y_1,y_2),u_1)|u_2) & = \frac12 W_2((y_1,y_2)|(u_1,u_2)). \end{aligned} This may look a little daunting, but we’ll soon look at an example illustrating how this works.

The important point is that we can verify that the total capacity is preserved, $$I(\chan{1}{2})+I(\chan{2}{2}) = I(W_2) = 2I(W),$$ and redistributed advantageously $$I(\chan{1}{2})\leq I(W) \leq I(\chan{2}{2}).$$ That is, $\chan{1}{2}$ surrenders some of its capacity to $\chan{2}{2}$, pushing $\chan{1}{2}$ toward a useless channel and $\chan{2}{2}$ toward a perfect channel.

Let’s consider what happens when our channel is a binary erasure channel $BEC(\ep)$.

Working out the transition probabilities for $\chan{1}{2}$, we have $$\begin{array}{c||c|c} (y_1,y_2) \backslash x & 0 & 1 \\ \hline (0,0) & \frac12(1-\ep)^2 & 0 \\ (0,1) & 0 & \frac12(1-\ep)^2 \\ (0,*) & \frac12(1-\ep)\ep & \frac12(1-\ep)\ep \\ (1,0) & 0 & \frac12(1-\ep)^2 \\ (1,1) & \frac12(1-\ep)^2 & 0 \\ (1,*) & \frac12(1-\ep)\ep & \frac12(1-\ep)\ep \\ (*,0) & \frac12(1-\ep)\ep & \frac12(1-\ep)\ep \\ (*,1) & \frac12(1-\ep)\ep & \frac12(1-\ep)\ep \\ (*,*) & \ep^2 & \ep^2 \\ \end{array}$$ Rather than focusing on the individual probabilities, notice that five of the nine received symbols satisfy $\chan{1}{2}(y|0)~\chan{1}{2}(y|1) \neq 0$. More specifically, if either $y_1=*$ or $y_2=*$, it is equally likely that 0 or 1 was sent.

This observation is reflected in the fact that the capacity and reliability have both decreased. More specifically, we have \begin{aligned} I(\chan{1}{2}) = (1-\ep)^2 & \leq (1-\ep) = I(W) \\ Z(\chan{1}{2}) = 2\ep-\ep^2 & \geq \ep = Z(W), \\ \end{aligned} where we recall that larger values of the Bhattacharyya parameter indicate less reliability.

Let’s compare this to the second channel $\chan{2}{2}$. The received symbols are now $\cy^2\times \cx$. If we work out the conditional probabilities for half of them, the ones having the form $((y_1,y_2),0)$, we find $$\begin{array}{c||c|c} ((y_1,y_2),0) \backslash x & 0 & 1 \\ \hline ((0,0),0) & \frac12(1-\ep)^2 & 0 \\ ((0,1),0) & 0 & 0 \\ ((0,*),0) & \frac12(1-\ep)\ep & 0 \\ ((1,0),0) & 0 & \frac12(1-\ep)^2 \\ ((1,1),0) & 0 & 0 \\ ((1,*),0) & 0 & \frac12(1-\ep)\ep \\ ((*,0),0) & \frac12(1-\ep)\ep & 0 \\ ((*,1),0) & 0 & \frac12(1-\ep)\ep \\ ((*,*),0) & \frac12\ep^2 & \frac12\ep^2 \\ \end{array}$$ This channel behaves much differently. For all but one received symbol, we can uniquely determine the sent symbol $x$. In particular, we can recover the sent symbol even if one of the received symbols has been erased. This leads to an increase in the capacity and a decrease in the Bhattacharyya parameter: \begin{aligned} I(\chan{2}{2}) = 1-\ep^2 & \geq 1-\ep = I(W) \\ Z(\chan{2}{2}) = \ep^2 & \leq \ep = Z(W). \\ \end{aligned}

The capacities of these channels, as a function of $\epsilon$, are shown below.

It’s worth thinking about why the second channel $\chan{2}{2}$ performs better than the original. While the vector channel $W_2:\cx^2\to\cy^2$ transmits the pair $(u_1,u_2)$, $\chan{2}{2}$ assumes that $u_1$ is transmitted without error so that the received symbol is $((y_1,y_2), u_1)$. Since we know $u_1$ arrives safely, we are essentially transmitting $u_2$ through $W$ twice. It is this redundancy that causes the capacity and reliability to both improve. In practice, we will need to deal with our assumption that $u_1$ is transmitted correctly, but we’ll worry about that later.

If we had instead considered a binary symmetric channel $W = BSC(p)$, we find a similar picture:

### Polarization: The recursive step

We have now taken the first step in the polarization process: We started with two copies of $W$ and created two new polarized channels, one with improved capacity, the other with diminished capacity. Next, we will repeat this step recursively so as to create new channels, some of which approach perfect channels and some of which approach useless channels.

When $N$ is a power of 2, we form the vector channel $W_{N}:\cx^{N}\to\cy^{N}$ from two copies of $W_{N/2}$. The recipe to create $W_4$ from two copies of $W_2$ is shown below.

From the inputs $(u_1, u_2, \ldots, u_{N})$, we form $(s_1,\ldots, s_{N})$ as $s_{2i-1} = u_{2i-1}\oplus u_{2i}$ and $s_{2i} = u_{2i}$. We then use a reverse shuffle permutation to form $$(v_1,v_2,\ldots,v_{N}) = (s_1, s_3, \ldots, s_{N-1}, s_2, s_4,\ldots, s_{N}),$$ which serve as the inputs for two copies of $W_{N/2}$.

While that may sound a little complicated, the representation of $W_8$ shown below illustrates the underlying simplicity.

We now form new symmetric, binary-input channels $$\chan{i}{N}:\cx\to \cy^N\times \cx^{i-1}$$ as we did before: \begin{aligned} \chan{i}{N}(((y_1,\ldots,y_N),& (u_1,\ldots,u_{i-1}))~|~u_i) = \\ & \frac1{2^{N-1}}\sum_{(u_{i+1},\ldots,u_N)} W_N((y_1,\ldots,y_N)|(u_1,\ldots,u_N)). \end{aligned}

Arikan then proves a remarkable result:

If we choose $\delta\in(0,1)$, then as $N$ grows through powers of two, the fraction of channels satisfying $I(\chan{i}{N}) \in (1-\delta, 1]$ approaches $I(W)$. Likewise, the fraction of channels with $I(\chan{i}{N}) \in [0,\delta)$ approaches $1-I(W)$.

Suppose, for instance, that we begin with a channel whose capacity $I(W) = 0.75$. As we repeat the recursive step, we find that close to 75% of the channels are nearly perfect and close to 25% are nearly useless.

More specifically, with the binary erasure channel $BEC(0.25)$, whose capacity is 0.75, and $N=1024$, the channel capacities are as shown below. About three-quarters of the channels are close to perfect and about one-quarter are close to useless.

For comparison, here’s the result with $BEC(0.75)$, whose capacity is 0.25.

It should be no surprise that the channels $\chan{i}{N}$ with small $i$ are close to useless while the ones with large $i$ are close to perfect. Remember that $\chan{i}{N}$ assumes that the symbols $(u_1,\ldots,u_{i-1})$ are transmitted without error. When $i$ is large, we know that many of the symbols are reliably transmitted so the channel transmits $u_i$ with a lot of redundancy. This redundancy causes an increase in channel capacity $I(\chan{i}{N})$ and a corresponding decrease in our reliability measure $Z(\chan{i}{N})$.

### Encoding and decoding

Now that we have a large number of nearly perfect channels, we can describe how data is encoded and decoded to enable transmission through them. For some large value of $N$, we have approximately $NI(W)$ nearly perfect channels and $N-NI(W)$ nearly useless channels. Our goal is to funnel all the data through the nearly perfect channels and ignore the others.

To illustrate, consider the channel $W = BEC(0.5)$ with $I(W) = 0.5$, and suppose we have created $N=8$ channels. This is a relatively small value of $N$, but we expect $NI(W) = 4$ nearly perfect channels and $N-NI(W)=4$ nearly useless channels. The capacity of the 8 channels is as shown here.

We will choose channels 4, 6, 7, and 8 to transmit data. To do so, we will fix values for $u_1$, $u_2$, $u_3$, and $u_5$ and declare these to be “frozen” (these are polar codes, after all). Arikan shows that any choice for the frozen bits works equally well; this shouldn’t be too surprising since their corresponding channels transmit hardly any information. The other symbols $u_4$, $u_6$, $u_7$, and $u_8$ will contain data we wish to transmit.

Encoding the data means evaluating the $x_i$ in terms of the $u_j$. As seen in the figure, we perform $N/2=4$ additions $\lg(N) = 3$ times, which illustrates why the complexity of the encoding operation is $O(N\log N)$.

Now we come to the problem of decoding: if we know ${\mathbf y} = (y_1,\ldots, y_N)$, how do we recover ${\mathbf u} = (u_1,\ldots, u_N)$? We decode ${\mathbf y}$ into the vector $\hat{\mathbf u} = (\hat{u}_1,\ldots,\hat{u}_N)$ working from left to right.

First, we declare $\hat{u}_i = u_i$ if $u_i$ is frozen. Once we have found $(\hu_1, \ldots, \hu_{i-1})$, we define the ratio of conditional probabilities $$h_i = \frac{\chan{i}{N}(({\mathbf y}, (\hu_1,\ldots,\hu_{i-1}))|0)} {\chan{i}{N}(({\mathbf y}, (\hu_1,\ldots,\hu_{i-1}))|1)}.$$ If $h_i \geq 1$, then $u_i$ is more likely to be 0 than 1 so we define $$\hu_i = \begin{cases} 0 & \mbox{if}~ h_i \geq 1 \\ 1 & \mbox{else.} \end{cases}$$

Two observations are important. First, suppose we begin with ${\mathbf u}=(u_1,\ldots,u_N)$ and arrive at $\hat{\mathbf u} = (\hu_1,\ldots,\hu_N)$ after the decoding process. If $\hat{\mathbf u} \neq {\mathbf u}$, we say that a block error has occurred. Arikan shows that, as long as the fraction of non-frozen channels is less than $I(W)$, then the probability of a block error approaches 0 as $N$ grows. This makes sense intuitively: the channels $\chan{i}{N}$ we are using to transmit data are nearly perfect, which means that $Z(\chan{i}{N})\approx 0$. In other words, these channels are highly reliable so the frequency of errors will be relatively small.

Second, Arikan demonstrates a recursive technique for finding the conditional probabilities $\chan{i}{N}$ in terms of $\chan{j}{N/2}$, which means we can evaluate $h_i$ with complexity $O(\log N)$. Therefore, the complexity of the decoding operation is $O(N\log N)$, matching the complexity of the encoding operation.

The complexity of the encoding and decoding operations means that they can be practically implemented. So now we find ourselves with a practical means to transmit data at capacity $I(W)$ and with an arbitrarily small error rate. We have therefore constructed an encoding that is optimal according to the Fundamental Theorem of Noisy Channels.

### Summary

Arikan’s introduction of polar codes motivated a great deal of further work that brought the theory to a point where it could be usefully implemented within the new 5G standard. One obstacle to overcome was that of simply constructing the codes by determining which channels should be frozen and which should transmit data.

In our example above, we considered a specific channel, a binary erasure channel $W = BSC(\ep)$, for which it is relatively easy to explicitly compute the capacity and reliability of the polarized channels $\chan{i}{N}$. This is an exception, however, as there is not an efficient method for finding these measures for a general channel $W$. Fortunately, later work by Tal and Vardy provided techniques to estimate the reliability of the polarized channels in an efficient way.

Additional effort went into improving the decoding operation, which was, in practice, too slow and error prone to be effective. With this hurdle overcome, polar codes have now been adopted into the 5G framework, only 10 years after their original introduction.

## What’s the 411?

The intent of this column is to explain how [Claude] Shannon proposed that we measure information and how this measure helps us understand the rate at which it can be transmitted. …

David Austin
Grand Valley State University

### Introduction

Tech news these days is filled with stories about the upcoming 5G revolution in wireless networking. And though the revolution seems a little slow getting started, we're promised it will bring significantly faster transmission rates and shorter delays.

While there are several reasons to expect improvements, one significant factor is the use of polar codes to encode information for transmission. Introduced by Erdal Arikan in 2009, polar codes are optimal, in a precise sense describd by information theory, and well suited for this application.

Originally, I intended for this column to be about polar codes, but, after reading around for a while, I thought it could be useful to back up and describe how information is quantified. Much of this theory originates with Claude Shannon's seminal 1948 paper, which is brilliant but can be challenging to read. On the other hand, many recent expositions aim for more rigor and lose some of the intuition that newcomers may seek to develop a clear sense of what the subject is ultimately about.

So the intent of this column is to explain how Shannon proposed that we measure information and how this measure helps us understand the rate at which it can be transmitted.

### An information source

The first thing to know is that information, as quantified by Shannon, is a measure of the amount of choice we have in expression.

In Grand Rapids, Michigan, where I live, a local television station frequently promotes its aptly named Weatherball, a large ball, prominently displayed on a tower, whose color gives a simple weather forecast:

Weatherball blue, cooler in view.
Weatherball green, no change foreseen.

To Grand Rapids residents, the Weatherball is a much-loved icon inspiring an amount of affection that, in spite of my 20 years of residence, has never been adequately explained. To Shannon, however, the Weatherball is an information source capable of transmitting messages composed of three symbols: red, blue, and green. His measure of information isn't defined by the symbol transmitted at any one time, but rather by the fact that there are three choices for symbols. This would be a low information source.

In contrast, a forecast from the National Weather Service gives tomorrow's predicted high temperature. As an information source, there is greater choice in the message transmitted, which makes this a higher information source. For instance, if we see that the high temperature today will be 25 degrees, we also know that it won't be 37 or 62.

So our measure of information will relate to the number of possible symbols we can produce. There is, however, also a statistical component to this measure.

I haven't analyzed the Weatherball to know whether one color is displayed more frequently than others, but it seems possible that the colors (red, blue, and green) are displayed equally often. That is, the probability of seeing one of the colors is 1/3.

Suppose instead that we imagine another Weatherball in Pleasantville, a city in a more moderate climate where the temperature from one day to the next doesn't change a lot. In that case, green ("no change foreseen") would be more likely than the other colors, red or blue. Let's imagine that the probability of seeing each of the colors is:

 Green: 1/2 Red: 1/4 Blue: 1/4

While a typical sequence in Grand Rapids might look like this:

a typical sequence in Pleasantville may be:

Shannon tells us that the Weatherball in Grand Rapids is a higher information source than Pleasantville's in spite of the fact that they use the same three symbols. Here's why. Suppose we want to transmit the Weatherball's state every day over a wireless network. We first need to encode it into a suitable format, and a binary code seems like a good candidate. We can leverage the fact that we are more likely to see green in Pleasantville to create an efficient encoding. Consider, for instance, the encoding:

 Green: 0 Red: 10 Blue: 11

That is, we use a shorter code for the symbol we see more frequently. The average number of binary digits sent is then $$\frac12\cdot 1 + \frac 14\cdot 2 + \frac14\cdot 2 = \frac32.$$ That is, we transmit, on averate, 3/2 binary digits per symbol.

In Grand Rapids, the average number of digits sent is $$\frac13\cdot 1 + \frac 13\cdot 2 + \frac13\cdot 2 = \frac53,$$ which is larger than the 3/2 binary bits per symbol in Pleasantville.

The fact that we can squeeze the Pleasantville symbols into a smaller number of binary digits without losing information indicates that the Pleasantville Weatherball is a lower information source. (Of course, we've only looked at one encoding so there may be better choices for Grand Rapids.)

Remember that we can think about information as a measure of the freedom we have to choose a message. Since the Weatherball in Pleasantville causes us to choose green more frequently than red or blue, our choice is somewhat restricted, which lowers the amount of information.

The real power of this interpretation, as we will see later, is that we can use it to understand the maximal rate at which information can be transmitted.

Embedded in this example is a pipeline, as illustrated on the left.

We begin with an information source, the Weatherball in our example, that produces a message as a sequence of symbols. These symbols are encoded, in some way, into a new set of symbols that is suitable for transmission through a communication channel.

The channel could be any means of conveying encoded messages, such as a wireless network, a telegraph line, or semaphore flags. As illustrated here, the message that enters the channel is the same one that exits. It's possible that the message is somehow corrupted in the channel, and this can be built into the theory, though we won't do so here.

If we are given an information source and a channel, our central task will be to find an encoding that maximizes the rate at which information is transmitted through the channel.

Our first task will be to state explicitly what we mean by an information source and then quantify the information generated by it.

### Information sources

As in the examples above, an information source will produce a message composed of a sequence of symbols. The Weatherball might produce $$GRBRGRBRGRBRGRBGRBGGGRRBRGR\ldots$$ if we record the color every day for a period of time. However, the source also includes statistical data about the frequency with which the symbols appear.

Shannon describes the English language as an information source. In its simplest form, this source is capable of generating strings of letters and other punctuation symbols. Focusing just on the 26 letters, we know that letters are not equally likely; for instance, "E" appears much more frequently than "Z."

But there is additional statistical information besides the frequencies with which letters appear. For instance, the letter "Q" is almost always followed by "U" (though Scrabble players know a few other options). That is, the probability of encountering a symbol may depend on what comes before it.

For this reason, we consider an information source to be a Markov or stochastic process. At any time, the source is in a state $S$, and the probabilities for generating the next symbol depend on $S$.

We can represent this situation graphically, as in the following example. Suppose our source generates messages with three symbols $A$, $B$, and $C$ and that the state we are in depends only on the last symbol generated. We label the states with the last symbol and represent the probabilities for generating the next symbol, and hence moving to the next state, as shown below:

If we consider an information source in which the probabilities for the symbols are independent of what has come before, then our Markov process has a single state that we never leave.

The processes we consider will always be ergodic, which has the following intuitive interpretation: Two very long messages produced by the source should have similar statistical properties. For instance, consider two typical novels written in English. Since English is an ergodic information source, we expect that the frequencies with which letters, digrams, trigrams, and so forth appear should be nearly the same for the two novels.

### Entropy

Following Shannon, we would like to create a measure of information associated to a Markov process. For the time being, suppose that the $n$ symbols appear with probabilities $p_1$, $p_2$, …, and $p_n$. Our measure $H(p_1,p_2,\ldots,p_n)$ should satisfy the following criteria:

• $H$ is continuous in the probabilities $p_i$.

• Adding more symbols creates more choice and, hence, more information. More specifically, considering sources where each of $n$ symbols is equally likely, that is, $p_i = 1/n$, we expect $H$ to increase with $n$.

• If the process of choosing a symbol can be broken into two successive choices, then $H$ adds appropriately. For instance, consider the following description of the Pleasantville Weatherball: we first choose, with equal probability, whether the weather will change or not. If the weather changes, we choose, again with equal probability, whether it warms up or cools down. The following diagram illustrates:

In this case, we want $$H\left(\frac14,\frac14,\frac12\right) = H\left(\frac12,\frac12\right) + \frac12H\left(\frac12,\frac12\right).$$

Shannon shows that these three properties determine $H$ up to a multiplicative constant, which leads to the definition: $$H(p_1,p_2,\ldots,p_n) = -\sum_i p_i\log_2(p_i).$$

A similar quantity appears in statistical mechanics as entropy, a measure of a system's disorder, so we call $H$ the entropy of the information source.

Here are some examples.

• Suppose the information source generates two symbols, 0 and 1, with equal probability. Then, $$H\left(\frac12,\frac12\right) = -\frac12\log_2\left(\frac12\right) -\frac12\log_2\left(\frac12\right) = 1.$$

More generally, if the source generates 0 with probability $p$ and 1 with probability $1-p$, then $$H(p) = -p\log_2(p) – (1-p)\log_2(1-p),$$ whose graph is

Notice that $H(0) = H(1) = 0$. For example, if $p=0$, the source is forced to generate only 1's so that its only possible message is 11111111111…. Here, the source has no information since we know what the message will be before even receiving it.

We see that $H$ is a maximum when $p=1/2$, when both 0 and 1 appear with equal probability, since this means the source offers the greatest freedom of choice when generating messages.

As this source generates one binary digit per symbol, we say that the units on $H$ are bits per symbol. As the graph above shows, however, we shouldn't equate bits with binary digits since $H$ is not necessarily an integer.

• Another example Shannon describes is a teletype, an electromechanical device for transmitting messages. Pressing a key on the keyboard of a teletype machine punches some combination of five possible holes on a piece of paper tape. When read back, each of five electrical contacts open or close depending on pattern of holes punched in the tape. In this way, each of $2^5 = 32$ keys are encoded in a system of five binary digits. If each key is pressed with equal probability, the entropy of this source is $\log_2(32) = 5$ bits per symbol. In this example, a bit faithfully corresponds to a binary digit.

• The Grand Rapids Weatherball has three symbols, which we assumed appear with equal probability. This leads to $H = \log_2(3) \approx 1.58.$ By contrast, the Weatherball in Pleasantville has $H = 3/2 = 1.5$, confirming our earlier suspicion that the Weatherball in Pleasantville is a lower information source. In fact, we encoded the Pleasantville source into a stream of binary digits using $3/2$ bits per symbol, which implies, as we'll see later, that this is the best possible encoding.

• The entropy of the English language may be understood through a series of approximations. For convenience, let's focus only on the 26 letters. As a zero-th order approximation, Shannon begins with the source having 26 symbols, each appearing with equal probability. We then have $H=\log_2(26) \approx 4.70$.

As a first-order approximation, we imagine that the letters are generated with their naturally occurring frequencies so that, for instance, "E" appears with probability 0.12 and "Z" with 0.02. We would expect that the entropy decreases from the zero-th order approximation since we are now imposing some restrictions on the source. Indeed, we find that $H=4.18$.

As mentioned earlier, the probability that a given letter is generated depends on the letter before it, which leads to a second order approximation. For instance, when "Q" appears, there is a probability of 0.92 that "U" comes next but only 0.02 that "A" follows. Our source is now a Markov process with states $S$ defined by the last letter generated. For each state $S$, we find an entropy $H(S)$ using the probabilities of generating the next letter. The entropy of the Markov process is then the weighted average: $$H = \sum_i P_i~H(S_i)$$ where the sum is over all states $S_i$ and $P_i$ is the probability of being in state $S_i$ at any time. We find the entropy to be $H = 3.66$.

We may, of course, consider a third-order approximation where a state is determined by the previous two letters generated; higher-order approximations appear in the same way. With increasing order, messages generated start to look more and more like intelligible English.

To summarize, we have

 Order Entropy 0 4.70 1 4.18 2 3.66

Each increase in order adds more restrictions on how letters are generated, which leads to less freedom in choosing messages, which causes the entropy to decrease.

Given an information source, Shannon defines the relative entropy to be the ratio of the entropy to the maximum entropy using the same set of symbols. For instance, the Pleasantville Weatherball has an entropy of 3/2 bits per symbol. If we give the same three symbols equal probability, as in the Grand Rapids Weatherball, we have a maximal entropy of $\log_2(3)$. The relative entropy of the Pleasantville Weatherball is then $(3/2)/\log_2(3) = 0.946$. The redundancy of an information source is one minus the relative entropy so, in this case, we have a redundancy of about 5.4%, which presents an opportunity to compress the information, which we did above by encoding it into a sequence of binary digits.

Using the approximations described above, Shannon estimated the entropy of English to be about 2.3 bits per letter, giving a redundancy of about 50%. This means that it's possible to delete half the letters in an English message and reconstruct the meaning of the whole, a fact that's not surprising to anyone who texts or tweets frequently. This is also good news for those of you who only read half of Moby Dick, provided you read an appropriate half.

### Interpretations of entropy

So far, we have described entropy as a measure of the amount of freedom we have in creating a message. Let's examine a couple of implications.

First, suppose that an information source with entropy $H$ has $n$ symbols appearing with probabilities $p_1$, $p_2$, …, $p_n$ and that this source generates a message as a long sequence having $N$ symbols. Each symbol appears about $p_iN$ times in the message so that the probability of obtaining that message is approximately $$p \approx p_1^{p_1N} p_2^{p_2N} \ldots p_n^{p_nN}.$$ Notice that $$\log_2(p) \approx p_1N\log_2(p_1) + p_2N\log_2(p_2) + \ldots p_n\log_2(p_n) = -NH.$$ so that $$p \approx 2^{-NH}.$$

This leads to two important conclusions. First, the probability of obtaining a long message is approximately constant. This is perhaps not too surprising if we remember our assumption that the information source is ergodic. We also conclude that there are about $2^{NH}$ such sequences. Notice that larger values of $H$ imply a larger number of messages, reinforcing our interpretation of $H$ as a measure of our freedom to choose a message.

This observation can be made more precise by stating that the messages of $N$ symbols, as $N$ grows large, can be divided into two sets: one is a set of about $2^{NH}$ messages each of which occurs with roughly equal probability (we will refer to these as high probability messages) and the other is a set whose total probability becomes vanishingly small as $N$ goes to infinity.

Relative entropy provides additional meaning here. Suppose we have an information source $I$ whose relative entropy is $r$ and redundancy is $\rho = 1-r$. If $\overline{H}$ is the maximal entropy of an information source having the same set of $n$ symbols as $I$, then $\overline{H} = \log_2(n)$. This maximal source produces $n^N=2^{N\overline{H}}$ messages of length $N$, each of which is equally probable. Contained within this set is a smaller set of $2^{NH}$ high probability messages for the information source $I$. Therefore, the fraction of high probability messages for $I$ is $$2^{NH}/n^N = 2^{N (H-\overline{H})} = 2^{-N\overline{H}(1-r)} = 1/(n^N)^{1-r} = 1/(n^N)^\rho$$ where $\rho$ is the redundancy of $I$.

Notice that, if $\rho=0$, there is no redundancy so every possible message is a high probability message. In contrast, we said that the redundancy of English is about $\rho=1/2$. If we take the 26 letters of the alphabet and randomly form sequences of 100 letters, then about $$1/(26^{100})^{1/2} = 1/26^{50} \approx 1.8\times 10^{-71}$$ of those will be meaningful English sequences. In this way, we could quantify how long it would take a randomly typing monkey to write a meaningful sentence of 100 characters.

### Rates of transmission

We've been looking at entropy as a measure of information and have seen how it intuitively feels like a useful measure. With a little more work, however, we will see that entropy is exactly the right measure for determining the rate at which information can be transmitted.

To begin, recall our pipeline. We have an information source generating messages as sequences of symbols. We would like to transmit messages through a channel, which could be something like a wireless network or telegraph line. We can also encode the messages generated by the source in an effort to (a) more efficiently transmit them and (b) improve the reliability of the transmission in the event there is noise in the channel that might corrupt the message.

While Shannon presents a general framework for describing the rate at which the channel can transmit symbols, we will consider a simplified version that retains the essential ingredients. To that end, suppose our channel transmits $C$ binary digits in one time unit. In $T$ time units, the channel is capable of transmitting $CT$ digits and hence $2^{CT}$ possible messages.

Suppose that our source generates symbols at the rate of $R$ symbols per unit time. In $T$ time units, we have $RT$ symbols and therefore $2^{RTH}$ high probability messages. To reliably transmit these messages in $T$ units of time, we need to be capable of sending more than the number of high probability messages: $$2^{CT} \geq 2^{RTH}$$ and hence $CT \geq RTH$. This shows that the transmission rate is bounded by $$R \leq \frac CH$$ symbols per unit time. In this way, the entropy provides an upper bound for the rate at which symbols can be transmitted. Higher entropy leads to more information, which leads to a lower rate.

So the transmission rate is bounded by $C/H$, but we can say more. In fact, it's always possible to find an encoding that allows a transmission rate as close to $C/H$ as we wish, and Shannon shows us how to construct this encoding using what we now call a Shannon code.

First, note that we are going to encode messages of $N$ symbols rather than individual symbols. Let's begin by collecting all possible messages of $N$ symbols and recording the probability with which they are generated. Next, we'll order the messages so that their probabilities are non-increasing. That is, if $q_i$ is the probability of the $i^{th}$ message, then $$q_1 \geq q_2 \geq q_3 \geq \ldots.$$ By $C_i$, we will mean the cumulative probability of this sequence so that $$C_i = \sum_{j\lt i} q_j.$$ Here's the punchline: the $i^{th}$ message will be encoded as the binary expansion of $C_i$ using $d_i=\lceil\log_2(1/q_i)\rceil$ binary digits.

• That probably seems like a lot to take in so let's return to Pleasantville and look at an example. Remember we have three symbols $G$, $R$, and $B$ and probabilities $p_G = 1/2$, $p_R = 1/4$, and $p_B = 1/4$. Also, the entropy is $H=3/2$.

We will encode messages of length $N=1$, which are just single symbols. Ordering these three messages, we have the following table, which includes the binary expansion of $C_i$. $$\begin{array}{c|c|c|c|c} {\bf Message} & q_i & C_i & d_i = \lceil\log_2(1/q_i)\rceil & {\bf Encoding} \\ \hline G & 1/2 & 0=0.0 & 1 & 0 \\ R & 1/4 & 1/2 = 0.10 & 2 & 10 \\ B & 1/4 & 3/4 = 0.11 & 2 & 11 \\ \end{array}$$

Therefore, we have found the same encoding we used earlier:

 Green: 0 Red: 10 Blue: 11

Notice that the average length of a message is $$L = \frac 12\cdot 1 + \frac 14 \cdot 2 + \frac 14 \cdot 2 = 3/2$$ binary digits per symbol and that this average length equals the entropy $L = H$. If our channel transmits $C$ bits per unit time and we have an average length of $L$ bits per symbol, then we can transmit at the rate of $$\frac {C}{L} = \frac CH.$$ In other words, this encoding leads to the best possible transmission rate.

• Let's consider another information source with symbols $A$, $B$, $C$ generated with probabilities $p_A = 2/3$, $p_B = 1/6$, and $p_C=1/6$. The entropy of this source is $$H = -\frac23\log_2(2/3) – \frac16\log_2(1/6) – \frac16\log_2(1/6) \approx 1.25$$ bits per symbol.

As before, let's consider encoding messages consisting of $N=1$ symbols. We then find $$\begin{array}{c|c|c|c|c} {\bf Message} & q_i & C_i & d_i = \lceil\log_2(1/q_i)\rceil & {\bf Encoding} \\ \hline A & 2/3 & 0 = 0.0 & 1 & {\rm 0} \\ B & 1/6 & 2/3 = 0.101 & 3 & {\rm 101} \\ C & 1/6 & 5/6 = 0.110 & 3 & {\rm 110} \\ \end{array}$$

The average length of a message is $$L_1 = \frac23\cdot 1 + \frac 16\cdot 3 + \frac 16\cdot 3 = \frac53 \approx 1.66.$$ Notice that $L_1 = 1.66 > 1.25 = H$, which means that we can use this encoding to transmit at the rate $C/L_1$, which is only 75% of the optimal rate of $C/H$.

Let's see if there is an improvement if we encode messages composed of $N=3$ symbols. There are $3^3 = 27$ such messages so rather than writing the encoding explicitly, we will focus on finding $L_3$, the average number of bits per symbol. The table below groups messages according to their probability and records the number in each group under the column labeled "#". $$\begin{array}{c|c|c|c|c|c} {\bf Message} & q_i & {\rm \#} & d_i = \lceil\log_2(1/q_i)\rceil \\ \hline AAA & 8/27 & 1 & 2 \\ AAB,… & 4/54 & 6 & 4 \\ ABB,… & 2/108 & 12 & 6 \\ BBB,… & 1/216 & 8 & 8 \\ \end{array}$$ This means that the average length per symbol is $$L_3 = \frac13\left(\frac{8}{27} \cdot 2 + \frac{4}{54}\cdot 6 \cdot 7 + \frac{2}{108}\cdot12\cdot6 + \frac{1}{216}\cdot8\cdot8\right) = \frac43 \approx 1.33.$$ Using this encoding, we are able to transmit at the rate of $C/L_3$, which is about 94% of the optimal rate of $C/H$. This improvement is due to the fact that more frequently occurring messages are encoded with a shorter string of bits.

As we see, the length 3 encoding results in greater compression, which produces an improvement in the transmission rate. We might expect that encoding messages of even greater length will offer continued improvement. In fact, Shannon proves that $L_N$ is a non-increasing function of $N$ that approaches $H$. In this way, it is possible to create an encoding that gives a transmission rate as close as possible to the optimal rate of $C/H$.

If you work out a few examples, you'll be able to see that each message is uniquely encoded and that an encoded message can be uniquely decoded, requirements that are essential for properly decoding messages.

We have now seen what Shannon calls the Fundamental Theorem for a Noiseless Channel:

Given an information source with entropy $H$ and a channel transmitting at $C$ bits per unit time, it is possible to find an encoding that leads to a transmission rate as close to $C/H$ symbols per unit time as desired. It is not possible to transmit at a rate greater than $C/H$.

Of course, there are some drawbacks. First, we need to enumerate and sort all the messages with $N$ symbols, and, when $N$ is large, this is likely to be difficult. Second, encoding long messages may result in delays in transmission. Imagine, for instance, having a phone conversation in which a sentence cannot be transmitted until it has been completely spoken.

Shannon's code is useful for demonstrating that the optimal rate of transmission is approachable, but it is only one possible encoding. Perhaps another encoding scheme will approach the optimal rate more quickly or even equal the optimal rate. That's where Arikan's polar codes enter our story, but that message is being encoded and will be transmitted over this channel in June!

### Summary

In this column, we've been looking at transmission over a clear channel; that is, what comes out of the channel is exactly what went in to it. Anyone who's had a conversation by cell phone knows it doesn't work that way, as noise frequently garbles the message. (It's ironic that, among all its capabilities, a cell phone performs worse at being an actual telephone.)

Shannon's paper also considers noisy channels and develops a measure of how much information can be transmitted over such a channel. A key idea is conditional entropy. If the symbols going into the channel form a set $X$ and the symbols coming out form $Y$, we would hope that knowing we received $y$ should enable us to determine a unique $x$ that went in to the channel to produce $y$. Noise, however, means that we can't be completely confident about the input $x$ so instead we consider the probabilities of different inputs $x$ leading to $y$. That is, we consider conditional probabilities $p(x|y)$, which leads to a conditional entropy, a measure of the amount of uncertainty in reconstructing $x$ if we know $y$. A noiseless channel will have zero conditional entropy since there is no uncertainty.

Using conditional entropy, Shannon is able to determine the optimal rate at which information can be transmitted through a noisy channel. This result may initially seem surprising: if the channel is noisy, how can we trust anything that we receive? Redundancy in the source comes to the rescue, however. For instance, readers will be able to make sense of this sentence that emerged from a noisy channel:

Npisy chamnels still ttansmkt ingormqtion.

In this column, we've been looking at discrete information sources so it's worth noting that Shannon also developed a theory that applies to continuous sources.

Entropy, in fact, appears with some ubiquity. We have already mentioned its appearance in statistical mechanics. Ecologists use a similar quantity to measure biodiversity. Once you start thinking about information sources, you may begin to see them all around you. In particular, you may wonder whether mathematics itself is an information source that is encoded in a highly compressed format using mathematical notation. What is its entropy and its redundancy?

Finally, readers are surely eager to know more about the Grand Rapids Weatherball so I will share that the current ball, pictured above, is the second incarnation. The first Weatherball, located on the roof of a downtown building, weighed 64 tons. We've already said this is a low information source, but now we can see that the entropy per pound is astonishingly small.

### References

• Claude Shannon and Warren Weaver. The Mathematical Theory of Communication. University of Illinois Press. 1964.

This volume contains Shannon's original paper and also includes, as a preface, a non-technical overview of the subject and summary of Shannon's results by Weaver.

• Claude Shannon. Prediction and entropy of printed English. Bell System Technical Journal. 30.1 (1951): 50-64.

This paper describes Shannon's efforts to estimate the entropy of the English language.

• English Letter Frequencies, Practical Cryptography.

This site provides the frequencies of letters and digrams that I used to compute the entropy of the approximations to the English language.

# David Austin’s Column Archive

Here are David Austin’s older Feature Columns. David’s newest columns may be found here.