Principal Component Analysis:

Three Examples and some Theory

Very often, especially in applications to the life sciences, useful low-dimensional projections exist and allow humans to grasp a data set that would otherwise be inscrutable.

Tony Phillips
Stony Brook University

Introduction

Principal component analysis (PCA), an algorithm for helping us understand large-dimensional data sets, has become very useful in science (for example, a search in Nature for the year 2020 picks it up in 124 different articles). Human minds are good at recognizing patterns in two dimensions and to some extent in three, but are essentially useless in higher dimensions. The goal of PCA is to produce the most useful possible 2 or 3-dimensional projection of a high-dimensional data set—most useful in that the smallest amount of information is lost by the projection. One can construct artificial examples where all the dimensions must be considered, but very often, especially in applications to the life sciences, useful low-dimensional projections exist and allow humans to grasp a data set that would otherwise be inscrutable. In this column we will look at PCA in three recently published applications (to anthropology and paleontology), and go through the mathematics that makes it work.

Example 1. Distinguishing nearby soil samples in a 78,000-year-old burial

burial PCA
In this PCA, 13-dimensional data from some 80 soil samples are projected into the plane spanned by their two principal components. The projection shows a clear distinction (highlighted by the superimposed 95% confidence ellipses) between samples from the burial pit (red dots) and samples (purple dots) from outside the pit at the same level (Layer 19) of the excavation. Image adapted from Nature, 593, 95-100, Extended Data Figure 4

Earliest known human burial in Africa was published in Nature on May 5, 2021. The authors are an international team of 36 led by María Martinón-Torres, Nicole Boivin and Michael Petraglia. In documenting the 78,000-year-old grave of a child in Panga ya Saidi, Kenya, they needed to establish that there had actually been a burial; they did this in part by showing that the earth surrounding the skeleton was different from the surrounding earth at that level (layer 19) in the archaeological excavation, i.e. that a pit had been dug to receive the child’s corpse. “The burial fill is a mix of ferruginous silt and sand, compositionally similar to the top of layer 18 and the base of layer 17, and different from layer 19, in which the pit was excavated.” They confirmed the difference by measuring the concentrations of 13 chemical elements occurring in the burial fill and in those three layers. To visuaize their ensemble of measurements (80 data points, each with 13 components), “[w]e performed PCA of EDXRF concentrations for the thirteen major, minor, and trace elements more frequently detected (Si, K, Ca, Ti, Mn, Fe, Zn, Ga, As, Rb, Y, Zr and Ba).” As they report, “The burial pit samples markedly differ from layer 19.”

The basic mathematics

In the “Earliest … burial” example, PCA gives a useful 2-dimensional representation of a cloud of points in a 13-dimensional space. (Astronomers describe the general procedure as “finding the principal axes of inertia of the cloud of samples about their mean in feature space.”) The a priori unlikely combination of basic elements of statistics, multivariable calculus and linear algebra that makes it work will be outlined in this section.

To be specific, suppose the data set consists of $n$ measurements taken on $M$ samples; geometrically it would be a set of $M$ points distributed in $n$-dimensional space, where each sample is a point and its $n$ measurements are its $n$ coordinates.

  • To fix notation, let measurements $x_1, \dots, x_n$ be taken on $M$ samples $s_1, \dots, s_m$.

  • For each $x_i$ we calculate the average, or mean

    $$\overline{x_i} = \frac{1}{M}\sum_{j=1}^M x_i(s_j).$$

  • Then for each $x_i$ we calculate the variance: this is the average of the square of the difference between the $x_i$-measurements and the mean: $${\rm Var}(x_i) = \frac{1}{M}\sum_{k=1}^M (x_i(s_k)-\overline{x_i})^2.$$

  • The variance ${\rm Var}(x_i)$ estimates the spread of $x_i$. Squaring the differences puts deviations to the right or the left of the mean on the same footing; it also gives more weight to the larger differences (the “outliers”). (The standard deviation, a more common measure of spread, is defined as the square root of the variance.)

  • For each pair $(x_i, x_j)$ of measurements we calculate the covariance $${\rm Cov}(x_i, x_j) = \frac{1}{M}\sum_{k=1}^M (x_i(s_k)-\overline{x_i})(x_j(s_k)-\overline{x_i}).$$ Note that ${\rm Cov}(x_i, x_i)={\rm Var}(x_i)$.

The various covariances are naturally displayed in  the covariance matrix

$${\bf Cov} ={\bf Cov}(x_1,\dots, x_n)=\left (\begin{array}{cccc}{\rm Cov}(x_1,x_1)& {\rm Cov}(x_1,x_2)&\dots&{\rm Cov}(x_1,x_n)\\ {\rm Cov}(x_2,x_1)&{\rm Cov}(x_2,x_2) &\dots &{\rm Cov}(x_2,x_n)\\ \vdots&\vdots&\vdots&\vdots\\ {\rm Cov}(x_n,x_1)&{\rm Cov}(x_n,x_2)&\dots&{\rm Cov}(x_n,x_n) \end{array} \right ).$$

Note that since ${\rm Cov}(x_i, x_j) = {\rm Cov}(x_j, x_i)$ the matrix ${\bf Cov}$ is symmetric. This fact will be very important when we study its algebraic properties.

So much for the statistics. The next calculations will be simpler to write if we assume our initial measurements have been centered (by subtracting from each one its original mean), so that each one has mean equal to zero. So now ${\rm Cov}(x_i, x_j) = \frac{1}{M}\sum_{k=1}^M x_i(s_k)x_j(s_k), $ for all $i,j=1,\dots,n$.

PCA starts with finding the direction in $n$-space (measurement-space) along which the data are maximally spread out, so that the clustering and spacing of the data points along that direction carry the most information. A direction is given by a unit vector ${\bf a}$ with components $a_1, \dots, a_n$, and the measurement in this direction is the linear combination $y=a_1 x_1 + \dots + a_n x_n$. Since the $x_i$ all have mean zero, the same must hold for $y$, so it is automatically centered. “Maximally spread out” in our context means having the greatest variance ${\rm Var}(y) = \frac{1}{M}\sum_{j=1}^M y(s_j)^2 $.

So we want to find the components $a_1, \dots, a_n$ leading to the maximum of ${\rm Var}(y) = {\rm Var}(a_1 x_1 + \dots + a_n x_n)$, subject to the constraint $ a_1 ^2 +\cdots +a_n^2=1$. Geometrically, we are maximizing ${\rm Var}(y)$ on the surface of the unit sphere in $n$-space. The way we learn to do this in multivariable calculus is to look for the points on the sphere $a_1 ^2 +\cdots +a_n^2=1$ where the gradient of ${\rm Var}(y)$ is perpendicular to the surface of the sphere or, equivalently, parallel to the gradient of the function $a_1 ^2 +\cdots +a_n^2$, which is $(2a_1, \dots, 2a_n)$. One of these points will be the maximum. (The  gradient of a function $f$ of $n$ variables $x_1, \dots, x_n$ is the vector $(\partial f/\partial x_1, \dots, \partial f/\partial x_n)$. It points in the direction of steepest increase.)

Numerically, the statement that the gradients of ${\rm Var}(y)$ and $a_i ^2 +\cdots +a_n^2$ are parallel means that there is a non-zero scalar $\lambda$ such that one is $\lambda$ times the other (this is the basis of the method of  Lagrange multipliers):

$$ \frac{\partial {\rm Var}(y)}{\partial a_1}= 2\lambda a_1$$

$$~~~~~~~~~~~~~~~~\vdots ~~~~~~~~~~~~~~~~(*)$$

$$ \frac{\partial {\rm Var}(y)}{\partial a_n}   =  2\lambda a_n. $$

Let us calculate $\partial {\rm Var}(y)/\partial a_j$.

$$\frac{\partial}{\partial a_j}{\rm Var}(y) = \frac{\partial}{\partial a_j}{\rm Var}(a_1 x_1 + \dots + a_n x_n).$$

Remembering that the $x_i$ are centered,

$${\rm Var}(a_1 x_1 + \dots + a_n x_n)= \frac{1}{M}\sum_{k=1}^M (a_1 x_1(s_k) + \dots + a_n x_n(s_k))^2$$

so by the chain rule

$$\frac{\partial}{\partial a_j}{\rm Var}(y)= \frac{1}{M} \sum_{k=1}^M [2(a_1 x_1(s_k) + \dots + a_n x_n(s_k))\cdot x_j(s_k)]$$

$$= 2a_1{\rm Cov}(x_1,x_j) + \cdots + 2a_n {\rm Cov}(x_n,x_j).$$

Equations $(*)$ then imply that for each $j$ between 1 and $n$,

$$2a_1 {\rm Cov}(x_1,x_j) + \cdots + 2a_n {\rm Cov}(x_n,x_j) = 2\lambda a_j.$$

Here is where linear algebra makes a miraculous intervention. Cancelling the 2s, and remembering that  ${\rm Cov}(x_i,x_j) = {\rm Cov}(x_j,x_i)$, these equations together can be rewritten as

$$ \left (
\begin{array}{cccc}
{\bf Cov}(x_1,x_1)& {\bf Cov}(x_1,x_2)&\dots&{\bf Cov}(x_1,x_n)\\
{\bf Cov}(x_2,x_1)&{\bf Cov}(x_2,x_2) &\dots &{\bf Cov}(x_2,x_n)\\
\vdots&\vdots&\vdots&\vdots\\
{\bf Cov}(x_n,x_1)&{\bf Cov}(x_n,x_2)&\dots&{\bf Cov}(x_n,x_n)
\end{array}
\right ) \left(
\begin{array}{c}a_1\\a_2\\\vdots\\a_n
\end{array} \right )= \lambda \left(
\begin{array}{c}a_1\\a_2\\\vdots\\a_n
\end{array} \right ).~(**)$$

We say that the column vector ${\bf a}$  is an eigenvector for the covariance matrix ${\bf Cov}$, with eigenvalue $\lambda$.

Finding the eigenvalues and eigenvectors for the $n\times n$ matrix ${\bf Cov}$ is a standard set of calculations in linear algebra, which I will summarize here.

  • The eigenvalues of ${\bf Cov}$ are defined as the roots of the degree-$n$ polynomial $\det({\bf Cov}-\lambda {\rm I})=0$, where ${\rm I}$ is the $n\times n$ identity matrix; but for large $n$ this definition is not practical: eigenvectors and eigenvalues are calculated by the Lagrange multiplier method as above. In applications to the life sciences, where there are not likely to be hidden symmetries in the problem, the eigenvalues will be all distinct; furthermore, since ${\bf Cov}$ is a symmetric matrix these are all real numbers, and can therefore be ordered $\lambda_1 > \lambda_2 > \cdots > \lambda_n$.

  • For each index $j$, the vector equation ${\bf Cov} \cdot {\bf a} = \lambda_j {\bf a}$ has a non-zero solution, unique up to a scalar multiple, so unique up to sign if we fix its length to be 1. This solution, call it ${\bf a}_j$, is the eigenvector corresponding to $\lambda_j$.

  • (For future reference, here is a useful fact: because the matrix is symmetric, eigenvectors corresponding to different eigenvalues are perpendicular.)

In particular the column vector ${\bf a}_1$, with components $ a_{11}, \dots, a_{1n}$, corresponding to the maximum eigenvalue $\lambda_1$, gives us the answer to our original question: the direction in measurement-space along which the data are maximally spread out. Notice that the indeterminacy in the sign of ${\bf a}_1$ does not affect its direction. In PCA, the quantity $y_1 = a_{11}x_1 + \dots + a_{1n}x_n$ is called PC1, the first principal component.

What is this maximum variance? The answer involves some calculation.

Remember that for $\displaystyle{y_1 = a_{11}x_1 +\dots + a_{1n}x_n}$,
$${\rm Var}(y_1) ={\rm Var}(a_{11} x_1 + \dots + a_{1n} x_n)= \frac{1}{M}\sum_{k=1}^M (a_{11} x_1(s_k) + \dots + a_{1n} x_n(s_k))^2$$
so
$${\rm Var}(y_1) =\frac{1}{M}\sum_{k=1}^M \sum_{i=1}^n\sum_{j=1}^n a_{1i}a_{1j}x_i(s_k)x_j(s_k)$$

$$= \sum_{i=1}^n\sum_{j=1}^n a_{1i}a_{1j}\frac{1}{M}\sum_{k=1}^M x_i(s_k)x_j(s_k) = \sum_{i=1}^n\sum_{j=1}^n a_{1i}a_{1j} {\rm Cov}(x_i,x_j)$$

$$ = (a_{11} ~ a_{12}~ \cdots ~a_{1n}) \left (
\begin{array}{cccc}
{\rm Cov}(x_1,x_1)& {\rm Cov}(x_1,x_2)&\dots&{\rm Cov}(x_1,x_n)\\
{\rm Cov}(x_2,x_1)&{\rm Cov}(x_2,x_2) &\dots &{\rm Cov}(x_2,x_n)\\
\vdots&\vdots&\vdots&\vdots\\
{\rm Cov}(x_n,x_1)&{\rm Cov}(x_n,x_2)&\dots&{\rm Cov}(x_n,x_n)
\end{array}
\right ) \left(
\begin{array}{c}a_{11}\\a_{12}\\\vdots\\a_{1n}
\end{array} \right ) $$
$$=(a_{11} ~ a_{12}~ \cdots ~a_{1n})~ \lambda_1 \left(
\begin{array}{c}a_{11}\\a_{12}\\\vdots\\a_{1n}
\end{array} \right ) = \lambda_1 (a_{11}^2 + a_{12}^2 + \cdots a_{1n}^2) = \lambda_1.$$

Amazingly enough, the variance of the linear combination $y_1 = a_{11}x_1 + \dots + a_{1n}x_n$ using the coefficients of ${\bf a}_1$ is exactly the corresponding eigenvalue $\lambda_1$.

Once we have found ${\bf a}_1$, we go on to look for the next most useful direction in which to sort the data. We look in the hyperplane (in measurement-space) perpendicular to ${\bf a}_1$ and repeat the analysis. As before we want to maximize ${\rm Var}(y) = {\rm Var}(b_1 x_1 + \dots + b_n x_n)$; subject to the constraint $b_1 ^2 +\cdots +b_n^2=1$; now we add the additional constraint $b_1a_{11} + \cdots + b_na_{1n} = 0$. The solution will be a point satisfying $b_1 ^2 +\cdots +b_n^2=1$ where the gradient of ${\rm Var}(y)$ is parallel to the gradient of $ b_i ^2 +\cdots +b_n^2$ and lies in the plane $b_1a_{11} + \cdots + b_na_{1n} = 0$. As before, the solution is an eigenvector of the covariance matrix; the condition $b_1a_{11} + \cdots + b_na_{1n} = 0$ eliminates ${\bf a}_1$, leaving ${\bf a}_2$ as the direction maximizing variance; $y_2 = a_{21}x_1 + \cdots a_{2n}x_n$ will be PC2, the second principal component, with variance calculated as above to be $\lambda_2$.

Some more linear algebra: equation $(**)$ specialized to the vector ${\bf a}_i$ can be written in matrix notation ${\bf Cov}\cdot{\bf a}_i = \lambda_i{\bf a}_i$. Amalgamating the $n$ column vectors ${\bf a}_1, \dots {\bf a}_n$ into a matrix ${\bf A}$ will give $ {\bf Cov}\cdot{\bf A} = {\bf A}\cdot{\bf \Lambda}$, where ${\bf \Lambda}$ is the diagonal matrix with entries $\lambda_1, \dots, \lambda_n$. Now ${\bf A}$, being made up of orthogonal columns of length 1, is an orthogonal matrix; its inverse is easy to calculate: it is the transpose ${\bf A}^{\top}$. Multiplying both sides of the last equation on the left by ${\bf A}^{\top}$ gives ${\bf A}^{\top}\cdot{\bf Cov}\cdot{\bf A} = {\bf A}^{\top}\cdot {\bf A}\cdot {\bf \Lambda}= {\bf \Lambda}$. On the other hand, ${\bf A}^{\top}\cdot {\bf Cov}\cdot{\bf A}$, being similar to ${\bf Cov}$, has the same trace ${\rm Cov}(x_1, x_1) + \cdots + {\rm Cov}(x_n, x_n)$, which then must be equal to the trace of ${\bf \Lambda}$, i.e. $\lambda_1 + \cdots + \lambda_n$. The sum of the eigenvalues is equal to the total variance. Often in applications to the life sciences the first two eigenvalues together account for a large proportion of the variance, and their associated principal components give a satisfactory two-dimensional representation of the entire $n$-dimensional panorama of the measurements (in our first two examples they account for 76% and 95.5%, respectively, of the total variance). Often also the third principal component is calculated to further refine the picture.

To complete the picture the $M$ points in $n$-space corresponding to the original measurements can be mapped into the (CP1, CP2)-plane using the matrix ${\bf A}$: $y_1(s_k) = a_{11}x_1(s_k)+\cdots+a_{1n}x_n(s_k)$, $y_2(s_k) = a_{21}x_1(s_k)+\cdots+a_{2n}x_n(s_k)$, or into (CP1, CP2, CP3)-space by using also $y_3(s_k) = a_{31}x_1(s_k)+\cdots+a_{3n}x_n(s_k)$.

Example 2. Sorting Kuehneotherium teeth

Kuehneotherium
Kuehneotherium (right) and its 200-million- years-ago contemporary Morganucodon are imagined by John Sibbick foraging for insects in what is now part of Wales. The painting, which appeared on the cover of Nature, is reproduced courtesy of Pamela Gill.

This example, which comes from a review article on PCA by Ian T. Jolliffe and Jorge Cadima in  Philosophical Transactions of the Royal Society A, “examines a dataset consisting of nine measurements on 88 fossil teeth from the early mammalian insectivore Kuehneotherium . …  Kuehneotherium is one of the earliest mammals and remains have been found during quarrying of limestone in South Wales, UK. The bones and teeth were washed into fissures in the rock, about 200 million years ago, and all the lower molar teeth used in this analysis are from a single fissure. However, it looked possible that there were teeth from more than one species of  Kuehneotherium in the sample.” The teeth came from the collection of the Natural History Museum London, and the measurements were made by Pamela G. Gill, School of Earth Sciences, Bristol. Her nine measurements are identified in the next figure.

tooth parameters
The nine parameters measured by Pamela Gill on each Kuehneotherium tooth. Image courtesy of Pamela Gil.

Cadima and Jolliffe ran a PCA of a subset Gill’s measurements (all the teeth from one of the fissures), and obtained the plot shown here.

tooth PCA
Cadima and Jolliffe’s PCA of the Kuehneotherium dataset. The original 88 data points have been projected onto the (PC1, PC2) plane, along with vectors corresponding to the nine original measurements. Image courtesy of Jorge Cadima.

They remark: “The first two PCs account for 78.8% and 16.7%, respectively, of the total variation in the dataset, so the two-dimensional scatter-plot of the 88 teeth given by [the figure] is a very good approximation to the original scatter-plot in nine-dimensional space.” Furthermore “All of the loadings in the first PC have the same sign, so it is a weighted average of all variables, representing ‘overall size’. In [the figure], large teeth are on the left and small teeth on the right. The second PC has negative loadings for the three length variables and positive loadings for the other six variables, representing an aspect of the ‘shape’ of teeth. Fossils near the top of [the figure] have smaller lengths, relative to their heights and widths, than those towards the bottom.”

They conclude: “The relatively compact cluster of points in the bottom half of [the figure] is thought to correspond to a species of Kuehneotherium, while the broader group at the top cannot be assigned to Kuehneotherium, but to some related, but as yet unidentified, animal.”

When the measurement space has very high dimension

In applications to genetics or to image analysis (where each pixel is a “measurement”) the number $n$ of measurements is often enormous compared to the number $M$ of samples. Following the CPA algorithm sketched above would mean dealing with an intractably large matrix.

Now this problem is only potentially $n$-dimensional, because all we have to work with are data from $M$ points. One way to conceptualize it is to think of each of the aggregate measurement results $x_i$ as a vector ${\bf x}_i$ with components $x_i(s_1), \dots x_i(s_M)$. Now we have $n$ vectors in an $M$-dimensional space, with $n>M$, so there is a set of (at most) $M$ linearly independent vectors which give all the ${\bf x}_i$ by linear combination. Suppose for simplicity there are exactly $M$, say ${\bf z}_1,\dots, {\bf z}_M$. Then our problem of finding a variance-maximizing linear combination $y=a_1x_1 + \dots +a_nx_n$, i.e. ${\bf y} = a_1{\bf x}_1 + \dots +a_n{\bf x}_n$ reduces to finding the much smaller linear combination ${\bf y} = c_1{\bf z}_1 + \dots +c_M{\bf z}_M$ maximizing variance. But how do we identify ${\bf z}_1,\dots, {\bf z}_M$?

Again, linear algebra has just the tools we need. With the same notation as above, let us compute for every $i,j$ between 1 and $M$ the quantity $ \frac{1}{n}(x_1(s_i)x_1(s_j)+x_2(s_i)x_2(s_j) + \cdots + x_n(s_i)x_n(s_j))$. These numbers fit in a new, $M\times M$, matrix we can call ${\bf Voc}$ to distinguish it from ${\bf Cov}$ defined above. We calculate the eigenvectors ${\bf b}_1, \dots, {\bf b}_M$ and corresponding eigenvalues $\lambda_1, \dots, \lambda_M$ for ${\bf Voc}$ as we did before.

There is a useful relation between ${\bf Cov}$ and ${\bf Voc}$. Suppose we call ${\bf X}$ the matrix of all the measurements of all the samples, so ${\bf X}_{ij} = x_i(s_j)$, and its transpose ${\bf X}^{\top}_{ij} = {\bf X}_{ji} = x_j(s_i)$. The matrix ${\bf X}$ has $M$ rows and $n$ columns, vice-versa for ${\bf X}^{\top}$. The matrix product ${\bf X}\cdot {\bf X}^{\top}$ is defined by $[{\bf X}\cdot {\bf X}^{\top}]_{ij} = {\bf X}_{i1} {\bf X}^{\top}_{1j} + \cdots + {\bf X}_{iM} {\bf X}^{\top}_{Mj} = x_i(s_1)x_j(s_1) + \cdots + x_i(s_M)x_j(s_M)$ $= M {\rm Cov}(x_1, x_j)$ whereas $[{\bf X}^{\top}\cdot {\bf X}]_{ij} = {\bf X}^{\top}_{i1} {\bf X}_{1j} + \cdots + {\bf X}^{\top}_{in} {\bf X}_{nj} = x_1(s_i)x_1(s_j) + \cdots + x_n(s_i)x_n(s_j)$ $ =n {\rm Voc}(s_i,s_j)$ So ${\bf X}\cdot {\bf X}^{\top} = n {\bf Cov}$, $n$ times the intractable matrix we are interested in, and ${\bf X}^{\top}\cdot {\bf X} = M {\bf Voc}$, $M$ times its smaller cousin. Now a little more linear algebra:

  • ${\bf X}^{\top}\cdot {\bf X}$ and ${\bf X}\cdot {\bf X}^{\top}$ have the same non-zero eigenvalues, $\lambda_1, \dots, \lambda_M$
  • For each eigenvector ${\bf b}_i$ of ${\bf X}^{\top}\cdot {\bf X}$, the vector ${\bf X}\cdot {\bf b}_i$ is an eigenvector for ${\bf X}\cdot {\bf X}^{\top}$ with the same eigenvalue $\lambda_i$.

The proofs of these a priori unlikely statements are immediate: If there is a non-zero vector ${\bf b}$ with ${\bf X}^{\top}\cdot {\bf X}\cdot {\bf b}=\lambda{\bf b}$, then ${\bf X}\cdot {\bf X}^{\top}\cdot {\bf X}\cdot {\bf b} =\lambda{\bf X}\cdot {\bf b}$, so ${\bf X}\cdot {\bf b}$ is an eigenvector for ${\bf X}\cdot {\bf X}^{\top}$ with eigenvalue $\lambda$.

Now if ${\bf b}$ is an eigenvector of the matrix ${\bf A}$ with eigenvalue $\lambda$, then ${\bf b}$ will also be an eigenvector of $k$ times ${\bf A}$, with eigenvalue $k\lambda$ (since $k{\bf A}\cdot {\bf b} = k({\bf A}\cdot {\bf b}) = k(\lambda {\bf b})=(k\lambda) {\bf b} $). So once we know the eigenvectors and eigenvalues for the smaller matrix ${\bf Voc}$, we can retrieve the eigenvectors and eigenvalues for ${\bf Cov}$, and complete our PCA.

The switch to the transpose makes examples like the next one possible.

Example 3. Parsing Native American genes in Polynesian ancestry

“Native American gene flow into Polynesia predating Easter Island settlement” was published in Nature, July 8, 2020. The authors are a team of 32 led by Alexander G. Ioannidis (Stanford), Javier Blanco-Portillo and Andrés Moreno-Estrada (both at LANGEBIO, Irapuato, Mexico). They begin: “The possibility of voyaging contact between prehistoric Polynesian and Native American populations has long intrigued researchers.” They mention Thor Heyerdahl’s controversial suggestion “that prehistoric South American populations had an important role in the settlement of east Polynesia and particularly of Easter Island (Rapa Nui),” and that the matter is still “hotly contested.” No longer, presumably: “We find conclusive evidence for prehistoric contact of Polynesian individuals with Native American individuals (around AD 1200)… Our analyses suggest strongly that a single contact event occurred in eastern Polynesia, before the settlement of Rapa Nui, between Polynesian individuals and a Native American group most closely related to the indigenous inhabitants of present-day Colombia.”

Part of their evidence is a detailed analysis of the ancestry of present-day inhabitants of Rapa Nui. They split them into two groups: those who have European ancestry and those who do not. They present a PCA comparing these two groups (along with other Pacific islanders) with a spectrum of present-day Native American inhabitants of Central and South America. It shows that the Rapa Nui inhabitants with some European ancestry group with populations in the South of Chile (Mapuche, Pehuenche, Huiiliche, Chilote) while those with no European ancestry group with the Zenu, a population in Colombia, at the opposite end of South America.

ioannides PCA
Part of Supplemental Figure 14 from {\em Nature} {\bf 583}, 572-577 (2020), used with permission. For legibility I have cut the PC1 axis in two and have suppressed the top of the range of PC2. The color key is shown separately below.
color key
Color key for the PCA by Ioannides et al.

The caption in Nature gives more detail:  “Our new weighted ancestry-specific SVD-completed principal component analysis (PCA) applied to the Native American component from Pacific islanders and reference individuals from the Americas. The Native American ancestries of reference individuals are plotted with open circles. The Native American ancestries of Pacific islanders are represented as solid points. … In addition, labels are plotted for the location of the allele frequency vector created from aggregating all Native American ancestry haplotypes in each island population. The Rapanui are split into two populations, those without European ancestry tracts (dark green) and those with high European and Native American ancestry (blue). The first principal component axis separates the reference Native Americans on a north-south axis, while the second principal component separates the Yamana of southern Patagonia. Pacific islanders’ Native American ancestry is seen to cluster with northern South American references with the exception of the Rapanui with high European ancestry, who cluster with the southern references (Chilean).”
Some background details about this research in a Stanford press release.

Conversations with Folkert Tangerman were very helpful in the preparation of this column.

Updated 31 August in response to a comment from a reader.

AMS Feature Column banner

An epidemic is a sequence of random events

If a contact is made, then whether or not infection is transferred is much like tossing a (loaded) coin. How can a simulation take all this uncertainty into account?

Bill Casselman
University of British Columbia

Just recently, I started thinking about making my own epidemics. On a computer, of course, with a digital population.

What I had in mind was not very sophisticated. I would assemble a population of `people’, a few of which are infected with the virus, and then progress from one day to the next, letting the virus spread around through interactions in the population. At each moment, a person would be in one of several possible states:

  • S Uninfected but susceptible to infection
  • E Infected, but not yet infectious (exposed)
  • A Infectious, but not yet showing symptoms (asymptomatic)
  • Infectious and showing symptoms
  • Recovered, or otherwise incapable of becoming infected (say, through vaccination)

This is not quite a complete representation of reality. For example, in the current epidemic a very small number of people get reinfected. But it is not too far from being correct.

In general, as the simulation goes on, a person would progress from one state in this list to the next, except of course that being vaccinated is a shortcut from the first to the last state. Infections take place because susceptible persons interact with contagious ones. Even when an interaction takes place, whether or not infection is transmitted is a function of many accidental circumstances (for example, surrounding ventilation) as well as how contagious the infected person is.

There is some further internal detail to some of these states. The degree to which a person is infectious changes in time, usually rising to a peak after a few days, and then decreasing to zero. Hence in a simulation each person has attached to him in addition to (i) a designation of state but also in states A and I (ii) a number measuring infectiousness. A further datum is (iii) the frequency of contacts, especially close contacts, a person has with others. This can change with time. For example, when a person starts showing symptoms, he will presumably reduce the frequency of his contacts.

Where’s the mathematics? An epidemic is driven by random events. The moment at which a person moves from one state to the next is not fixed by circumstances, but is instead a matter of probabilities. The severity of a person’s infection is a matter of chance, as is the length of time from when he is infected to when he becomes infectious. Even if we know the average rate at which an infectious person makes contacts, the exact number of contacts made in one day is also a matter of chance. If a contact is made, then whether or not infection is transferred is much like tossing a (loaded) coin. How can a simulation take all this uncertainty into account?

Generating contacts

Take the matter of contacts. The most important parameter governing contacts is the average number $c$ of contacts made by a person in one day, but that does not mean that the number of contacts in one day is constant. It might well vary from day to day. Instead, it is reasonable to assume that personal interaction is a Poisson process, which means that the probability of making $k$ contacts during one day is $p_{k} = c^{k} e^{-c} / k!$. Note that the infinite sum of the $p_{k}$ is $1$, because of the well known formula

$$ e^{c} = 1 + c + {c^{2} \over 2!} + { c^{3} \over 3! } + \cdots \, . $$

For example, here are the graphs of some examples with a couple of values of $c$:

In a simulation, one will be dealing with a large number of people. Each of them will have his own regimen of interactions. Some of them will be more interactive than others. Thus, we are likely to find ourselves simulating a large number of independent Poisson processes, each one a sequence of random events. How to do this? In a program, this will involve a call to a routine, call it p_random(c) that returns on each call a random non-negative integer whose distribution matches the Poisson process with mean $c$.

Almost every programming language has built into it a routine random() that does something like this. On each call it returns a real number uniformly distributed in the open interval $[0,1)$. (David Austin’s previous FC gives some idea of how this works.) What we would like to do is use that routine to generate non-negative integers following a specified Poisson distribution. To give you some idea of how things go, we can see how this technique can generate integers uniformly distributed in any integral range $[0,n-1]$: get a random number $x$ in $[0,1)$ and then replace it by $\lfloor nx \rfloor$, the integral part of $nx$. If $n=2$ this offers a simulation of coin tossing, and if $n=6$ a simulation of throwing a die.

There is a reasonably well known procedure that does what we want, and very generally. This is explained in Knuth’s classic text. Suppose we are given an arbitrary probability distribution of integers with given probabilities $p_{k}$ for $k \ge 0$. That is to say, we are looking at some repeated event somewhat like coin tossing, in which a non-negative integer $k$ occurs with probability $p_{k}$. How can a program generate integers distributed according to these statistics?

Let $P(k)$ be the cumulative distribution

$$ P(k) = {\sum}_{i = 0}^{k} p(i) $$

Thus $P(k)$ is the probability that the integer $j$ occurring is $\le k$. The original distribution has the property that each $p_{i} \ge 0$ and ${\sum} p(i) = 1$, so $P(k)$ increases from $0$ to $1$. For example, if $c = 2.5$ and $p(k) = e^{-c} c^{k} / k!$ then the graph of $P$ looks like the figure below. Given a random number $t$ in $[0,1)$ we can determine an integer according to the recipe indicated—draw a line to the right from the point $(0,t)$ and select the $x$-coordinate of the point at which it hits this graph.

There is another suggestive way to see this. Make up a rectangle of total height $1$, partitioned into boxes, with the one labeled $k$ of height $p_{k}$. Given the number $x$, mark a point at height $x$ in the rectangle. Select the label of the box that contains it.

In the long run the number of times you hit the box labeled $k$ will be proportional to its area, hence to $p_{k}$. But how do you tell what that label is? There is one straightforward answer to this question:

def p_random():
	x = random()
	# this is the built-in random number generator
	s = 0
	i = 0
	while s <= x:
		i += 1
		s += p[i]
	# at exit p[0] + ... + p[i-1] <= x < p[i]
	return i-1

But this is somewhat inefficient, since each call will on average involve $n/2$ steps. Does there exist an algorithm that requires a number of steps independent of $n$? The answer is yes. A clever method whose basic idea is apparently due to Alastair Walker does this, at the small cost of building some preliminary structures.

Walker’s trick

As far as I know, Walker never explained how he discovered his method, but an elegant interpretation has been offered by Keith Schwartz. The basic idea is what we have already seen:

  1. Start with a box of some kind. Partition it into smaller labeled boxes in such a way that the area of box $k$ is proportional to $p_{k}$.
  2. To generate integers with a given probability distribution, choose points at random inside the box, and return the label of the region hit.
  3. Arrange a way to assign to every random $x$ in $[0,1)$ a point of the box.

The problem is to figure out how to make the partition in such a way that figuring out the label from the geometry of the partition can be done efficiently.

I’ll explain how Walker’s method works for a few simple cases, but first I’ll generalize the problem so that we are not restricted to the Poisson distribution. Suppose very generally that we are given probabilities $p_{i}$ for $i$ in $[0, n-1]$. We now want a method to generate random integers that follow the distribution assigned by $p_{i}$. That is to say, if we generate in this way a large number of integers, we want the proportion of occurrences of $i$ to be roughly $p_{i}$.

The case $n=2$ is like tossing a biased coin, and there is a simple solution. In this case, we are given two probabilities $p_{0}$, $p_{1}$ with $p_{0} + p_{1} = 1$. Partition the unit square in this fashion:

Choose a point $(x, y)$ randomly in the square. In fact, we do not have to pay any attention to $x$. If $y \le p_{0}$ we return $i = 0$ and otherwise we return $i = 1$.

But now, following Keith Schwartz and intending to show how Walker’s algorithm works in this very simple case, I will set up a rectangular region a bit differently. First of all, make its dimensions $2 \times 1$. Partition it twice: once into halves, each half a unit square …

… and then build in each half, say in the $i$-th half, a box of dimensions $1 \times p_{i}$. Label these boxes. Unless $p_{0} = p_{1}$, one of these will overflow at the top:

So then we cut off the overflow and paste it (with label) into the other box:

This shows the case $p_{0} \le p_{1}$. If $p_{1} < p_{0}$ things look like this:

How do we use these diagrams to generate the random integers we want? Choosing a random uniform number $x$ in $[0,1)$ amounts as before to choosing a point in the rectangle. But we do this differently, and we interpret it differently. Given $x$, set $X = 2x$. Let $m$ be the integer part of $X$, which will be either $0$ or $1$: $m = \lfloor X \rfloor$. Let $y = X – m$, the fractional part of $X$. Associate to $x$ a point in the $m$-th box with height $y$. If $y \lt p_{m}$, then we are in the box labeled by $m$, otherwise in the other one. In either case, the process will select that label $m$.

Now look at the case $n = 3$, and suppose that we are given probabilities $p_{0}, p_{1}, p_{2}$ with $\sum p_{i} = 1$. We start off with a rectangle of size $3 \times 1$, partitioned into $1 \times 1$ boxes:

There are again different cases, depending on the relative sizes of the $p_{i}$. The easiest case is that in which two of the values of $p$, say $p_{0}$ and $p_{1}$, are less than $1/3$, which implies that the third is greater than $1/3$. Draw the $i$ boxes of dimension $1 \times p_{i}$ in the $i$-th square, like this:

Now cut off two pieces from the large box and paste them into the smaller one, getting:

I’ll explain in a moment how to use this to generate random numbers.

There is a second case, in which two of the $p_{i}$ are larger than $1/3$:

Here, we want to cut off from the tops and fill in the third. It is tempting to cut off exactly the overflows in the large regions and paste them in, but this would give the third region three labels. which is not what we want. So we fill in from just one of the large regions. This will leave some space in it.

We fill in the new empty space from the other large region. We are now finished:

How to use what we have constructed? In each case, we have partitioned the rectangle of size $3 \times 1$. First, into three unit squares, and then each of these in turn into one or two labeled rectangles. Given a random $x$ in $[0,1)$, we want to come up with some integer in $[0,3)$. How? We first scale it to get $X = 3x$. This will lie in the interval $[m, m+1)$ for $m = \lfloor X \rfloor$. We now turn attention to the $m$-th unit square. The integer we return will be one of the labels found in that square. Let $y = X – m$, the fractional part of $X$, which will be in $[0,1)$. If $y \lt p_{m}$ (the height of the bottom rectangle), p_random returns $m$, otherwise the alternate label in that square.

In effect we are assigning principal and alternate labels to the boxes. Except that there won’t be an alternate label if the box is full.

In the literature, the array I call `alternate’ is called alias, and the method described here is called the alias method.

The full algorithm

This method generalizes nicely. The original version seems to be due to Alastair Walker. It became well known because Knuth called attention to it (although mostly in exercises). Michael Vose then came up with a more efficient version, and made it handle rounding errors more stably.

I quote below, almost verbatim, the algorithm found originally in Vose’s paper. It improves the running time of Walker’s program, and corrects its handling of rounding errors. It has two parts. One is an initialization that sets up arrays prob and alias from the probability array $p$. These are used in the function rand, which returns a random variable in the range $[0,n-1]$, whose probabilities are specified in p of length n. The call to random returns a variable uniformly distributed in $[0, 1)$.

There are several loops in the program.The first assigns integers in the range $[0,n-1]$ to one of two explicit arrays large and small. Those in small are the $i$ such that $p_{i} \le 1/n$. As the program proceeds, the integers in these arrays are those whose boxes have not yet been filled. Implicit is a third subset of $[0,n-1]$, which I’ll call finished. This contains all those indices for which no further processing is necessary—i. e. whose box is filled.

In the loop [0] the two arrays small and large are initialized, and the subset finished is left empty. In every run through this loop, an index is taken from small, its square is filled, and it is added to FIN. This happens by removing filling material form one of the boxes in large, which therefore becomes smaller. It is added to either small or large, according to how much is left. In each of these loops, the total size of large and small is decremented.

def init(p):
    l = 0
    s = 0
    [0] for i in range(n):
        if p[i] > 1/n:
            large[l] = i
            l += 1
        else:
            small[s] = i
            s += 1
    [1] while s > 0 and l > 0:
        s -= 1
        j = small[s]
        l -= 1
        k = large[l]
        prob[j] = n*p[j]
        alias[j] = k
        p[k] += (p[j]-b)
        if p[k] > b:
            large[l] = k
            l += 1
        else:
            small[s] = k
            s += 1
    [2] while s > 0:
        s -= 1
        prob[small[s]] = 1
    [3] while l > 0:
        l -= 1
        prob[large[l]] = 1

def p_random():
    x = n*random(0, 1)
    m = floor(x)
    if (x - m) < prob[m]: return m
    else: return alias[m]

The last loops [2] and [3] of Vose’s code are necessary to deal with rounding errors, which I include without further comment.

Here is a typical run for a Poisson process with mean $\mu = 4$.

The simulation

Let’s see how to use Walker’s method to simulate how a person just infected goes on to infect others. Suppose that he starts to be infectious on the fifth day, and that the probability that he infects a contact is specified in the following table:

$$ \matrix { i = \hbox{day after infection} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & \ge 10 \cr r_{i} = \hbox{probability of infection} & 0 & 0 & 0 & 0 & 0.1 & 0.3 & 0.4 & 0.4 & 0.2 & 0 \cr } $$

Suppose also that he makes and average of $4$ close contacts per day, and that these follow a Poisson distribution. Applying Walker’s algorithm, we get a sample run of contacts like this:

$$ \matrix { i = \hbox{day after infection} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\ \dots \cr c_{i} = \hbox{number of contacts} & 5 & 2 & 3 & 4 & 3 & 3 & 2 & 1 & 3 & 3\ \dots \cr } $$

In this run, how many people does he infect? There is no unique answer to this question. It depends on something like a coin toss at each contact. What we can calculate is an average. On the fifth day he infects an average of $0.1 \cdot 3$ people, on the sixth … etc. All in all:

$$ \hbox{average number of infections} = {\sum} c_{i} r_{i} = 0.1 \cdot 3 + 0.3 \cdot 3 + 0.4 \cdot 2 + 0.4 \cdot 1 + 0.2 \cdot 3 = 3.0 \, . $$

This is much lower than the average number of people he infects, which is called the $R_{0}$ value for this example. Here it is $(0.1 + 0.3 + 0.4 + 0.4 + 0.2) \cdot 4 = 5.6$.

Reading further