Pseudonyms in Mathematics

Why would there be pseudonyms for the publication of research papers in mathematics? One possible motive might be… that a group of people collaborated on a piece of work and wanted to simplify the multiple authorship to a single name. Another reason appears to be playfulness or whimsy.

Joseph Malkevitch
York College (CUNY)

Introduction

Are you familiar with the novels of Currer Bell or Ellis Bell? Many people who have enjoyed reading Jane Eyre (1847) and Wuthering Heights (1848) are unaware that the authors of these books at the time they were published were known as Currer Bell and Ellis Bell.

In the 19th century there were several examples of ultimately famous women writing under a male pseudonym. Among the most notable were George Eliot as the name for Mary Ann Evans and George Sands as the name for Amantine Lucile Aurore Dupin. (Eliot attended courses on mathematics and physics in London and Geneva and incorporated mathematical themes in her novels.)

Portrait of George Eliot

Figure 1 (George Eliot, aka Mary Ann Evans, as painted by Alexandre-Louis-François d'Albert-Durade. Image courtesy of Wikipedia.)
 

Few who have not studied English literature will recognize the names of Currer Bell, Ellis Bell and Acton Bell as the names under which Charlotte (1816-1855), Emily (1818-1848) and Anne (1820-1849) Brontë published their work in their own times! Another interesting example, this time of a man writing under a pseudonym in the 19th century, was Lewis Carroll, whose true name was C.L. Dodgson.

Lewis Carroll

Figure 2 (C.L. Dodgson, otherwise known as Lewis Carroll, courtesy of Wikipedia.)
 

Dodgson was a distinguished mathematician but not as distinguished as he was to become by being an author! Few people know Dodgson for his mathematics but many have read the book or seen the movie version of Alice in Wonderland! In recent times, Dodgson's mathematical fame has been revived in association with his appealing method of using ranked ballots to decide elections.

While not a pseudonym, the family name that James Joseph Sylvester (1814-1897) was born into was "Joseph," not Sylvester. The Sylvester family practiced Judaism and Sylvester's older brother decided to change the family name from Joseph to Sylvester to avoid some of the anti-semitism that plagued British Jews at the time.

James Sylvester

Figure 3 (James Joseph Sylvester, courtesy of Wikipedia)
 

Sylvester was a student at Cambridge but when it came time to graduate and to get his Master's degree regulations would have required him to sign the "Articles," pledging allegiance to the Church of England, something he refused to do. It was his good fortune that Trinity College in Dublin was willing to grant him his degree based on his work at Cambridge because they were more open minded about religion! Ireland was a country dominated by Catholicism at the time while Britain had an established church because Henry VIII had broken with Catholicism and set in motion years of religious turmoil by bringing a Protestant reform movement to England. However, a big part of his motive was that the Vatican would not allow him to divorce his first wife, Catherine of Aragon, who was Catholic so that he could marry Anne Boleyn, who gave birth to the woman who would become Queen Elizabeth I!

Women who wanted to pursue mathematics, or more generally an education, had limited opportunities in the 19th century. It was not until 1869 that Girton College at Cambridge admitted women and, thus, offered a path for women who wanted to pursue mathematics to study the subject in a university setting. Winifred Edgerton Merrill became the first American woman to earn a PhD in mathematics, which she earned in 1886 from Columbia University. But rather amazingly, the granting of degrees to women by Cambridge was not allowed until 1948. The situation for women and Jews had similarities. Despite his growing reputation as a mathematician, Sylvester was not able to get an appointment at Oxford because of his religion. His famous stays in America were attempts to have his talents recognized by academic institutions but for complex reasons, some of which seem to have been related to his personality, he was unable to get a stable long term position in the United States. Ironically, he returned from America to Britain because he was offered the Savilian Professorship in Geometry at Oxford University, because the rules forbidding Jews to have that position had changed. (For more about Sylvester's career and the careers of prominent women mathematicians such as Emmy Noether and Sofya Kovalevskaya, the first woman to earn a doctorate in mathematics, see the April 2015 Feature Column on Mathematical Careers.)

Sofya Kovalevskaya

Figure 4 (Sofya Kovalevskaya, courtesy of Wikipedia.)
 

Why would there be pseudonyms for the publication of research papers in mathematics? One possible motive might be that a research paper might seem to a community used to a particular person's "impressive" work, to be less impressive and one might want to protect one's reputation by publishing under a pseudonym. Another reason might be that a group of people collaborated on a piece of work and wanted to simplify the multiple authorship to a single name. Another reason appears to be playfulness or whimsy. While there are many pseudonyms that were used for groups of mathematicians, the best known is Bourbaki. Bourbaki represented a group of mathematicians, initially primarily operating out of France, but having members eventually from many countries, to put mathematics on a "sound" foundation rather than look at its separate pieces as the total edifice. Bourbaki stood for the idea of showing with "hindsight" how to structure the vast river of mathematical knowledge in a way that built organically from a well structured foundation. I will not attempt to look at the members of Bourbaki here or its many accomplishments. Many of the members of Bourbaki had distinguished careers as individuals. I will not take on the history and mathematical accomplishments of Bourbaki, not because they are vast, though they are, but in favor of pseudonyms for groups who work in mathematical areas that have good track records with looking at mathematics that can be understood with a relatively limited mathematics background. My goal here is to mention two relatively recent groups of mathematicians who published under an anonymous name. I will say some things about why they might have done this and call attention to some of the research work that as individuals these people accomplished.

So I will look at in turn look at the work of Blanche Descartes and G.W. Peck. Blanche started "her" work earlier that "G.W.," so I will start with her.

Blanche Descartes

What associations do you have with the name Blanche Descartes? Perhaps Blanche Descartes was Rene Descartes's (1596-1650) wife or daughter? In fact Descartes never married but he did have a daughter Francine who died at the age of 5. But seemingly the name Blanche Descartes has no direct connection with Descartes, the great French mathematician and philosopher. Blanche Descartes was a pseudonym initially for joint work by Rowland Brooks, Cedric Smith, Arthur Stone and William Tutte. The four of them met as undergraduate students at Cambridge University in 1936 and were members of the Trinity Mathematical Society. The photo below shows them attending a meeting of the Trinity Mathematical Society, which as you notice, has no female members. Female students were not allowed to study at Trinity College until much later, the first women arriving in 1976!

Trinity Mathematical Society

Figure 5 (The Trinity Mathematical Society, photo courtesy of Wikipedia.)
 

Trinity College at Cambridge, where Isaac Newton had been an undergraduate and later a Fellow, was especially associated with mathematics. The four decided to use the name Blanche Descartes to publish some initial joint work they did. It has been argued that "Blanche" derives from letters involved in their names. Using first letters of Bill (William), Leonard (Brooks middle name), Arthur and Cedric, one gets BLAC and with some "padding" this becomes: BLAnChe. Perhaps Descartes involves a play on the phrase 'carte blanche' and since they all were studying mathematics for different reasons, perhaps a reference to Rene Descartes, who also dabbled in things other than mathematics.

One early question that the four looked at has to do with a problem which has come to be known as "squaring the square." The problem involves taking a square and decomposing (tiling) it into smaller squares of different side lengths. One can also consider the more general question of when a rectangle can be tiled with squares of different lengths. Figure 6 shows an example, discovered by A. Duijvestijn.

A squared square

Figure 6 (A square decomposed into squares of all different sizes. This one has the smallest number of possible squares in any such decomposition. Image courtesy of Wikipedia.)
 

This tantalizing question attracted much interest over the years and has been shown to have unexpected connections to mathematical ideas that seem to have no connection with tilings. Some interest in this problem occurred in the "recreational" mathematics community. It is not exactly clear why Brooks, Smith, Stone and Tutte chose to use a pseudonym but perhaps it might have had to do with their concern that this question was of less mathematical "importance" than topics their teachers at Cambridge considered worth students investing time in investigating. Mathematics is full of examples that were considered recreational at one time but have been shown to be rooted in deeper mathematical ideas. The squared-square problem is an example of such a problem.

For each of the members of Blanche Descartes I will indicate some information about them and try to provide an example of some mathematics associated with them. All of them were to achieve some measure of fame and acclaim under their "real" names, sometimes in limited venues but in Tutte's case on a big stage.

William Tutte (1917-2002)

William Tutte

Figure 7 (Photo of William Tutte. Courtesy of Wikipedia.)
 

Although Tutte was born in England, his academic career was spent largely in Canada at the University of Waterloo. This university has a complex footprint of the presence of mathematics:

  • Department of Applied Mathematics.
  • Department of Combinatorics and Optimization.
  • David R. Cheriton School of Computer Science.
  • Department of Pure Mathematics.
  • Department of Statistics and Actuarial Science.

When Tutte first went to Canada, he did so at the invitation of H.S.M. Coxeter to join the faculty at the University of Toronto. Not long after the University of Waterloo was founded, Tutte moved there and was associated with the Department of Combinatorics and Optimization which at the present time has about 30 members. The reason why his work in England before taking up residency at Waterloo is less well known is that he was part of the team that was assembled in England at Bletchley Park. Even after the war the people who worked at Bletchley Park were not allowed to speak of their work there until quite recently. This team included Alan Turing, who is justly famous for his work on breaking the Enigma machine cipher. However, even after the Enigma ciphers were being read, a code known as Fish, used by the German army and for messages sent by leaders in Berlin to field commanders, had not been broken. Tutte was one of a team that managed to read Fish even though, unlike the case for Enigma, there was no physical machine in the possession of the code breakers to help them with breaking the code. In essence Tutte managed to figure out how the machine used to send messages in Fish worked by reconstructing the exact operation of the machine even without having seen the machine! When this was accomplished, it became possible to decipher Fish messages for the remainder of the war without the Germans realizing that their codes were being read. This work was done before Tutte completed his doctorate.

When WWII ended, Tutte returned as a graduate student to Cambridge in 1945 to work on his doctorate. During this period he produced an example related to the famous four-color theorem which is known today as the Tutte graph in his honor. In this context, a graph is a collection of points (called vertices) and edges, which can be curved or straight, that join up vertices. There are various definitions of graphs but the one preferred by Tutte allows vertices to be joined to themselves (these edges are called loops) and allows pairs of distinct vertices to be joined by more than one edge. We see in Figure 8 on the right the Tutte graph, put together using three copies of the configuration on the left, known as a Tutte triangle. This graph has three edges at every vertex, that is, it is 3-valent or degree 3, is planar (can be drawn in the plane so that edges meet only at vertices) and is three-connected. When a graph has been drawn in the plane so that edges meet only at vertices, the graph is called a plane graph. 3-connected means that for any two vertices $u$ and $v$ in the graph one can find at least 3 paths from $u$ to $v$ that have only the vertices $u$ and $v$ in common. What is special about the Tutte graph is that there is no simple tour of its vertices that visits each vertex once and only once in a simple circuit, a tour usually known as a Hamilton circuit. If it were true that every 3-valent 3-connected graph in the plane had a hamilton circuit there would be a simple proof of the four-color theorem (every graph drawn in the plane can be face colored with 4 or fewer colors). (The coloring rule is that any two faces that share an edge would have to get different colors.)

A Tutte triangle The non-hamiltonian Tutte graph

Figure 8 (A Tutte triangle on the left and three Tutte triangles assembled to form a plane, 3-valent, 3-connected graph with no Hamilton circuit. Images courtesy of Wikipedia.)
 

While the Tutte graph is not the smallest planar, degree 3, 3-connected graph which has no simple closed curve tour of all of its vertices (such a tour is called a Hamilton circuit or HC), it is relatively easy to see why it can't have an HC. This fact follows from that one can show that if an HC visits the vertices of a Tutte triangle then the HC must enter using one of the two edges shown at the bottom and exit the Tutte triangle using the edge at the top. For those interested in applications of mathematics, the problem known as the Traveling Salesperson Problem (TSP) requires finding a minimum cost Hamilton circuit in a graph which has non-zero weights on its edges. Finding routes for school buses, group taxis, or when you run errands involves problems of this kind. Finding solutions to the TSP for large graphs seems to require rapidly growing amounts of computation as the number of sites to be visited (the vertices of the graph) grows.

In addition to his famous example of a graph not having a Hamilton circuit Tutte also showed a very appealing theorem about when a graph must have a Hamilton circuit.:

Theorem (W. T. Tutte).
Every 4-connected plane graph has a hamiltonian circuit.

The 4-connected condition means that given any pair of vertices $u$ and $v$ in the graph there must be at least 4 paths from $u$ to $v$ that have only $u$ and $v$ in common. In particular, this means that the graph must have at least 4 edges at each vertex. Figures show a very regular 4-connected plane graph the graph of the Platonic Solid known as the icosahedron and a much more irregular graph. Can you find a Hamilton circuit in each of these graphs? While there are algorithms (procedures) for finding an HC in a plane 4-connected graph that run relatively fast, the problem of finding an HC in an arbitrary graph is known to be computationally hard.

Graph of the icosahedron

Figure 9 (A 4-connected planar graph. This is the graph of the regular icosahedron. Image courtesy of Wikipedia.)
 

A plane 4-connected graph

Figure 10 (A planar 4-connected graph which is not very symmetric.)
 

Cedric A. B. Smith (1917-2002)

While Smith studied mathematics at Cambridge and published some research work in mathematics, his career path did not involve becoming an academic mathematician associated with a mathematics department. Rather he pursued a career as a statistician and worked at the Galton Laboratory of University College, University of London. Some of Smith's work was related to genomics, helping to locate where particular genes were on a chromosome.

Like Tutte, Smith made an easy-to-state contribution to the theory of Hamilton circuits. Clearly, having conditions that show a family of graphs must have an HC or providing examples of graphs where no HC exists is also intriguing. But it is also natural to ask if there might be families of graphs where graphs might have exactly one HC. Smith was able to provide an answer to this question for the important class of graphs where every vertex has exactly three edges at a vertex.

Theorem.
Every 3-valent cubic graph has a even number of circuits that pass through each vertex once and only once (Hamiltonian circuit).

Because this graph has one Hamilton circuit tour…

A cubic graph with an HC highlighted

Figure 11 (The dark edges show a Hamilton circuit is a 3-valent (cubic) graph.)
 

…by Smith's Theorem it must have another!

Rowland Leonard Brooks (1916-1993)

Rowland Brooks is most well known for a theorem about coloring graphs that is named for him. Let me begin with an application and show how the theorem of Brooks makes it possible to get insight into such situations.

Suppose we have a collection of committees, say, committees of students and faculty at a college. The collection of individuals who are on some committee will be denoted by S. The committee names are a, b, …, h and a table (Figure 12) indicates with an X when a pair of committees has one or more members of S on both committees. It would be nice to be able to schedule the 8 committees into as few hourly time slots as possible so that no two committees that have a common member meet at the same time. Committees with no common members could meet at the same time. A person can't be at the meeting of two committees the person serves on if those committees meet at the same time.

A table of committee conflicts data

Figure 12 (A table where when two committees have one or more people on both committees these committees should not meet at the same time. A conflict is indicated with an X.)
 

We can display the conflict information geometrically using a graph by having a dot for each committee and joining two vertices representing committees with an edge if these committees can't meet at the same time because they have members in common.

A graph coding the previous committee conflict data

Figure 13 (A conflict graph for 8 committees. Committees joined by an edge should have their meetings at different times. The graph can be vertex colored with 4 colors.)
 

The vertex coloring problem for graphs assigns a label to each vertex, called a color, with the requirement that two vertices jointed by an edge get different colors. We could assign a different color to each dot (vertex) in Figure 13, which would mean that we would use 8 time slots for the committee meetings. The minimum number of colors needed to color the vertices of a graph is called the chromatic number of a graph. Brooks was able to find a way to get what is often a good estimate for the chromatic number of a graph.

Brook's Theorem:
If G is a graph which is not a complete graph or a circuit of odd length, then the chromatic number of G is at most (less than or equal to) the degree (valence) equal to the largest degree (valence) that occurs in G.

For a general graph it is a difficult computational problem to determine the chromatic number, but Brook's Theorem often allows one to get a good estimate quickly. In particular, if a graph has a million vertices, no matter how complex the graph, if it has maximum valence 5, it will not require more than 5 colors (though the chromatic number may be less).

Can you find a coloring of the vertices in the graph in Figure 13 that improves over the estimate given by Brook's Theorem?

Arthur Stone (1916-2000)

Arthur Stone eventually made his way to the United States and taught for many years at the University of Rochester. His wife Dorothy (Dorothy Maharam) was also a mathematician and his son David (deceased 2014) and daughter Ellen also became mathematicians.

While Stone contributed to mathematics in a variety of theoretical ways perhaps he is best known for his invention of what are known as flexagons. The hexaflexagon shown in Figure 14 can be folded from the triangles shown in Figure 15.

A hexaflexagon

Figure 14 (An example of a hexaflexagon, image courtesy of Wikipedia.)
 

A template of triangles which fild to a hexaflexagon

Figure 15 (A template for a hexaflexagon, image courtesy of Wikipedia.)
 

While the members of Blanche Descartes published several articles, most of these publications were in fact the work of William Tutte. A more recent example of a group of mathematicians who published together under one name will now be treated.

G.W. Peck

G. W. Peck's publications can be found here. Unlike Blanche Descartes, whose members are now all deceased, of the 6 mathematicians who have sometimes written under the pseudonym of G.W. Peck, three of the six people involved are still alive (2020). What associations do you have with this name? Perhaps you might think of the distinguished actor Gregory Peck, but Eldred Gregory Peck (April 5, 1916 – June 12, 2003), the actor, did not write under the name G.W. Peck. The individuals involved in G.W. Peck's publications were Ronald Graham, Douglas West, George Purdy, Paul Erdős, Fan Chung, and Daniel Kleitman.

Let me make a few brief comments about each of the members of this remarkable collaboration of distinguished mathematicians in turn. As individuals they illustrate the remarkably varied ways that individuals have mathematical careers.

Ron (Ronald) Graham (1935-2020)

Ronald Graham

Figure 16 (A photo of Ronald Graham, courtesy of Wikipedia.)
 

Some people perhaps know Ron Graham best for his juggling and studying mathematical problems associated with juggling. However, before his recent death he was one of America's best known research mathematicians. Some of Graham's fame beyond the mathematics community itself was that Martin Gardner, a prolific popularizer of mathematics, often wrote about ideas which were called to his attention by Ron Graham. But somewhat unusually for mathematicians known for their theoretical work, Ron Graham also had a career as a distinguished applied mathematician. He had a long career as a mathematician at Bell Laboratories where he worked at times under the direction of Henry Pollak, who in addition to his work at Bell Laboratories also was a President of NCTM and MAA. Graham during his career was President of the American Mathematical Society and MAA. He also was head of the Mathematics Section (the position no longer exists) of the New York Academy of Sciences. During his long stay at Bell Laboratories he championed examples of discrete mathematical modeling. This included popularizing the use of graphs and digraphs to solve problems involved in scheduling machines in an efficient manner. He also called attention to the way that problems about packing bins with weights had connections to machine scheduling problems.

Graham published on a vast array of topics ranging from very technical to expository articles. His many contributions to mathematics can be sampled here.

Douglas West (1953- )

Douglas West

Figure 17 (A photo of Douglas West, courtesy of Wikipedia.)
 

In addition to his many research papers, West has published several comprehensive books in the areas of discrete mathematics and combinatorics. He is also noteworthy in maintaining and collecting lists of unsolved problems in the general area of discrete mathematics. In recent years he has helped edited the Problems Section of the American Mathematical Monthly.

George Purdy (1944-2017)

Purdy made many contributions to discrete geometry. He was particularly interested in questions related to arrangements of lines and the Sylvester-Gallai Theorem.

Paul Erdős (1913-1996)

The best known author of this group was Paul Erdős (1913-1996) whose career was marked by not having a specific educational or industrial job. For much of his life Erdős traveled between one location and another to work privately with individuals, and/or give public lectures where he often promoted easy-to-state but typically hard-to-solve problems in geometry, combinatorics and number theory.

Paul Erdos

Figure 18 (A photo of Paul Erdős, courtesy of Wikipedia.)
 

Here is a sample problem in discrete geometry that Paul Erdős called attention to, whose investigation over the years has sprouted many new techniques and ideas. It has also resulted in new questions as some aspect of the initial problem got treated. Suppose that $P$ is a finite set of $n$ distinct points in the plane with the distance between the points being computed using Euclidean distance. One must be careful to specify what distance is to be used because, for example, one could instead compute the result using another distance function, say, taxicab distance. Suppose $D(n)$ denotes the set of numbers one gets for distances that occurs between pairs of points in $P$. Paul Erdős raises questions such as:

  1. What is the largest possible number of elements in $D(n)$?
  2. What is the smallest possible number of elements in $D(n)$?
  3. Can all of the numbers in $D(n)$ be integers?
  4. Can all of the integers in $D(n)$ be perfect squares?

Such problems have come to be known as distinct distance questions. Question (a) is quite easy. However, the question (b) is not fully understood even today. Initially Erdős was concerned with the behavior of the smallest number of values that could occur in D(n) as n got larger and larger. As tools for getting insights into problems has grown as well as interesting examples with particular properties, many new questions have come to be asked. For example. if the points of the original set lie in equal numbers on two disjoint circles what is the smallest number of distinct distances that can occur?

Here is a problem in this spirit to contemplate. In each of Figures 19 and 20 we have a configuration of 20 points, together with some of the line segments that join up these points, when we think of the diagrams as graphs. Figure 19 shows a polyiamond, a configuration of equilateral triangles that meet edge to edge while Figure 20 shows a polyomino, congruent squares which meet edge to edge.

A polyiamond with 20 vertices

Figure 19 (A polyiamond with 20 vertices.)
 

A polyomino with 20 vertices

Figure 20 (A polyomino with 20 vertices)
 

Question: Investigate the number of distinct distances and integer distances that can occur in each of the configurations in Figures 19 and 20. Can you formulate interesting questions to ask about distances based on these two configurations and your thinking about them?

Fan Chung (1949- )

Paul Erdös, Ronald Graham and Fan Chung

Figure 21 (A photo of Ron Graham, Paul Erdős and Fan Chung, photo courtesy of Wikipedia.)
 

Fan Chung is an extremely distinguished mathematician. She was born in Taiwan but did her graduate studies in America, and Herbert Wilf was her doctoral thesis advisor. She was the wife of Ronald Graham, with whom she wrote many joint papers. Like her husband, she worked in both the industrial world and in academia. She worked at Bell Laboratories and at Bellcore after the breakup of the Bell System. Her research covers an unusually broad range of topics, but includes especially important work in number theory, discrete mathematics, and combinatorics. Fan Chung and Ron Graham both made important contributions to Ramsey Theory. Fan Chung still teaches at the University of California, San Diego.

Daniel Kleitman (1934- )

Daniel Kleitman

Figure 22 (A photo of Daniel Kleitman, courtesy of Wikipedia.)
 

The person who perhaps has most often published under the G.W. Peck moniker has been Daniel Kleitman, whose long career has been associated with MIT. Many of his publications are in the area of combinatorics and discrete mathematics, in particular results about partially ordered sets.

Concluding Thoughts

To the extent that there are individuals who can't study or publish mathematics because of discrimination, hopefully we can all work to tear down these barriers. Mathematical progress requires all the help it can get and to deny people the fulfillment that many who get pleasure from doing mathematics have is most unfortunate. While in the past pseudonyms were sometimes chosen to avoid prejudicial treatment, we can hope this will seem less and less necessary in the future.

References

Those who can access JSTOR can find some of the papers mentioned above there. For those with access, the American Mathematical Society's MathSciNet can be used to get additional bibliographic information and reviews of some of these materials. Some of the items above can be found via the ACM Digital Library, which also provides bibliographic services.

  • Appel, K., and W. Haken, J. Koch, Every Planar Map is Four Colorable. I: Discharging. Illinois J. Math. 21 (1977) 429-490.
  • Appel, K. and Haken, W. 1977 Every Planar Map is Four-Colorable, II: Reducibility. Illinois J. Math. 21, 491-567.
  • Ball, Derek Gordon. Mathematics in George Eliot's Novels. PhD thesis, University of Leicester, 2016.
  • Brooks, R., and C. Smith, A. Stone, W. Tutte, The Dissection of Rectangles into Squares, Duke Mathematical Journal, 7 (1940) 312-40.
  • Chiba, N. and T. Nishizeki, "The hamiltonian cycle problem is linear-time solvable for 4-connected planar graphs." Journal of Algorithms 10.2 (1989): 187-211.
  • Chung, Fan RK, Lectures on spectral graph theory, CBMS Lectures, Fresno, 92 (1996): 17-21.
  • Chung, Fan, and R. Graham, Sparse quasi-random graphs, Combinatorica 22 (2002): 217-244.
  • Descartes, Blanche, Network colorings, Math. Gaz. 32 (1948) 67-69.
  • Erdős, .P, and A. H. Stone. On the structure of linear graphs, Bull. Amer. Math. Soc 52.(1946) 1087-1091.
  • Erdős, Paul, and George Purdy. "Some extremal problems in geometry." Discrete Math. 1974.
  • Graham, R., The Combinatorial Mathematics of Scheduling, Scientific American 238 (1978), 124-132.
  • Graham, R., Combinatorial Scheduling Theory, in Mathematics Today: Twelve Informal Essays, ed. L. Steen, Springer-Verlag, N.Y. (1978), 183-211.
  • Henle, M. and B. Hopkins, eds. Martin Gardner in the twenty-first century. Vol. 75. American Mathematical Soc., Providence, 2012.
  • Kleitman, D.. On an extremal property of antichains in partial orders. The LYM property and some of its implications and applications, Combinatorics, 2 (1974) 77-90.
  • Kleitman, Daniel J., and Douglas B. West. "Spanning trees with many leaves." SIAM Journal on Discrete Mathematics 4.1 (1991): 99-106.
  • Purdy, George. "Two results about points, lines and planes." Discrete mathematics 60 (1986): 215-218.
  • Robertson, N. and D. Sanders, P. Seymour, R. Thomas, New Proof of the Four Colour Theorem. Electronic Research Announcement, Amer. Math. Soc. 2, (1996) 17-25.
  • Solymosi, J, and C.. Tóth, Distinct distances in the plane, Discrete & Computational Geometry 25 (2001): 629-634.
  • Smith, C. and S Abbott, The story of Blanche Descartes. Mathematical Gazette (2003) 23-33.
  • Stone, A. H., Paracompactness and product spaces, Bulletin of the American Mathematical Society 54 (1948): 977-982.
  • Szekely, Crossing numbers and hard Erdős problems in discrete geometry, Combinatorics, Probability and Computing 6 (1997): 353-358.
  • Tutte, W. T., Squaring the Square, Mathematical Games column, ed. M. Gardner, Scientific American, Nov. 1958.
  • West, D., Introduction to Graph Theory. Vol. 2. Upper Saddle River, NJ: Prentice hall, 1996.

Completing the Square

The prehistory of the quadratic formula

Completing the square is the essential ingredient in the generation of our handy quadratic formula. And for all the years before the formula existed, that was how quadratic equations were solved…

Tony Phillips
Stony Brook University

Quadratic equations have been considered and solved since Old Babylonian times (c. 1800 BC), but the quadratic formula students memorize today is an 18th century AD development. What did people do in the meantime?

The difficulty with the general quadratic equation ($ax^2+bx+c=0$ as we write it today) is that, unlike a linear equation, it cannot be solved by arithmetic manipulation of the terms themselves: a creative intervention, the addition and subtraction of a new term, is required to change the form of the equation into one which is arithmetically solvable. We call this maneuver completing the square.

Here is how students learn about the maneuver today:

The coefficient $a$ is non-zero (or else the equation is linear) so you can divide through by $a$, giving $x^2 +\frac{b}{a}x + \frac{c}{a} =0$. Now comes the maneuver: you add and subtract $\frac{b^2}{4a^2}$, giving $x^2+\frac{b}{a}x +\frac{b^2}{4a^2} -\frac{b^2}{4a^2}+ \frac{c}{a} =0.$ Now the first three terms are a perfect square: $x^2+\frac{b}{a}x +\frac{b^2}{4a^2} = (x+\frac{b}{2a})^2$: you have completed the first two terms to a square. Instead of having both the unknown $x$ and its square $x^2$ in the equation, you have only the second power $(x+\frac{b}{2a})^2$ of a new unknown. Now arithmetic can do the rest. The equation gets rearranged as $(x+\frac{b}{2a})^2 = \frac{b^2}{4a^2} – \frac{c}{a}$ so $x+\frac{b}{2a} = \pm\sqrt{\frac{b^2}{4a^2} – \frac{c}{a}}$ $= \pm\sqrt{\frac{b^2-4ac}{4a^2}}=\pm\frac{\sqrt{b^2-4ac}}{2a}$ giving the solution $x=-\frac{b}{2a}\pm\frac{\sqrt{b^2-4ac}}{2a}$ =$\frac{-b\pm\sqrt{b^2-4ac}}{2a}$. This expression of the solution to the equation as an algebraic function of the coefficients is known as the quadratic formula.

So completing the square is the essential ingredient in the generation of our handy quadratic formula. And for all the years before the formula existed, that was how quadratic equations were solved. The maneuver was originally explicitly geometric: numbers were identified with areas, and a certain L-shaped polygon was completed to a geometric square. With the development of algebra the job could be done by manipulation of symbols, but for about a millennium the symbolic solution was always buttressed by a geometric argument, as if the algebra alone could not be trusted. Meanwhile our conception of the numbers that the symbols represented evolved, but very slowly. Negative solutions were considered “false” even by Descartes (1596-1650), but once they were accepted, the identification of equations with actual geometric problems eventually evaporated: in Euler’s Algebra (1770) it is gone without a trace.

In this column I will explore some episodes in this history, starting with the Old Babylonians and ending with an occurrence of square-completion in an important calculation in 20th century physics.

  • In Old Babylonian Mathematics
  • The connection to Plimpton 322
  • In Islamic mathematics
  • Aryabhata and Brahmagupta
  • In Fibonacci’s Liber Abaci
  • In Simon Stevin’s L’Arithmétique
  • Descartes’ La Géometrie
  • In Maria Agnesi’s Instituzioni Analitiche …
  • In Leonhard Euler’s Vollständige Anleitung zur Algebra
  • In 20th century physics

    In Old Babylonian Mathematics

    Obverse of tablet YBC 6967
    Reverse of tablet YBC 6967

    The Yale Babylonian Collection’s tablet YBC 6967, as transcribed in in Neugebauer and Sachs, Mathematical Cuneiform Texts, American Oriental Society, New Haven, 1986. Size 4.5 $\times$ 6.5cm.

    The Old Babylonian tablet YBC 6967 (about 1900 BC) contains a problem and its solution. Here is a line-by-line literal translation from Jöran Friberg, A Remarkable Collection of Babylonian Mathematical Texts, (Springer, New York, 2007).

    The igi.bi over the igi 7 is beyond.
    The igi and the igi.bi are what?
    You:
    7 the the igi.bi over the igi is beyond
    to two break, then 3 30.
    3 30 with 3 30 let them eat each other, then 12 15
    To 12 15 that came up for you
    1, the field, add, then 1 12 15.
    The equalside of 1 12 15 is what? 8 30.
    8 30 and 8 30, its equal, lay down, then
    3 30, the holder,
    from one tear out,
    to one add.
    one is 12, the second 5.
    12 is the igi.bi, 5 the igi.

    The Old Babylonians used a base-60 floating-point notation for numbers, so that the symbol corresponding to 1 can represent for example 60 or 1 or 1/60. In the context of YBC 6967, the reciprocal numbers, the igi and the igi.bi, have product 1 0. Their difference is given as 7.

    Robson's YBC 6967 solution diagram

    Here is a diagram of the solution to the YBC 6967 problem, adapted from Eleanor Robson’s “Words and Pictures: New Light on Plimpton 322” (MAA Monthly, February 2002, 105-120). Robson uses a semi-colon to separate the whole and the fractional part of a number, but this is a modern insertion for our convenience. The two unknown reciprocals are conceptualized as the sides of a rectangle of area (yellow) 1 0 [or 60 in decimal notation]. A rectangle with one side 3;30 [$= 3\frac{1}{2}$] is moved from the side of the figure to the top, creating an L-shaped figure of area 1 0 which can be completed to a square by adding a small square of area 3;30 $\times$ 3;30 = 12;15 [$= 12\frac{1}{4}$]. The area of the large square is 1 0 + 12;15 = 1 12;15 [$ = 72\frac{1}{4}$] with square root 8;30 [$=8\frac{1}{2}$]. It follows that our unknown reciprocals must be 8;30 + 3;30 = 12 and 8;30 − 3;30 = 5 respectively.

    In modern notation, the YBC 6967 problem would be $xy = 60, x-y = 7$, or $x^2-7x-60=0$. In this case the term to be added in completing the square is $\frac{b^2}{4a^2}=\frac{49}{4}=12\frac{1}{4}$, corresponding exactly to the area of the small square in the diagram.

     

    This tablet, and the several related ones from the same period that exist in various collections (they are cataloged in Friberg’s book mentioned above), are significant because they hold a piece of real mathematics: a calculation that goes well beyond tallying to a creative attack on a problem. It should also be noted that none of these tablets contains a figure, even though Old Babylonian tablets often have diagrams. It is as if those mathematicians thought of “breaking,” “laying down” and “tearing out” as purely abstract operations on quantities, despite the geometrical/physical language and the clear (to us) geometrical conceptualization.

    The connection to Plimpton 322

    Plimpton 322
    The Old Babylonian tablet Plimpton 322 (approx. $13 \times 9$ cm) is in the Columbia University Library. Image courtesy of Bill Casselman (larger image). For scholarly discussions of its content see Friberg’s book or (available online) Eleanor Robson’s article referenced above.

    There have been many attempts at understanding why Plimpton 322 was made and also how that particular set of numbers was generated. It has been described as a list of Pythagorean triples and as an exact sexagesimal trigonometry table. An interpretation mentioned by R. Creighton Buck and attributed to D. L. Voils (in a paper that was never published) is “that the Plimpton tablet has nothing to do with Pythagorean triplets or trigonometry but, instead, is a pedagogical tool intended to help a mathematics teacher of the period make up a large number of igi-igibi quadratic equation exercises having known solutions and intermediate solution steps that are easily checked” (igi, igi.bi problems involve a number and its reciprocal; the one on YBC 6769 is exactly of this type).

    In the solution of the problem of YBC 6769, we squared 3;30 to get 12;15, added 1 to get $1\, 12;15$ and took the square root to get 8;30. Then the solutions were 8;30 + 3;30 and 8;30 $-$ 3;30. So to set up an igi, igi.bi problem which will work out evenly we need a square like 12;15 which, when a 1 is placed in the next base-60 place to the left, becomes another square.

    Now the first column of Plimpton 322 contains exactly numbers of this type. For example, in row 5, the first undamaged row, the column I entry is $48\, 54\, 01\, 40$ [=10562500] with square root $54\, 10$ [=3250]. Adding 1 gives $1\, 48\, 54\, 01\, 40$ [=23522500] with square root $80\, 50$ [=4850]. The corresponding igi, igi.bi problem would ask for two reciprocals differing by two times $54\, 10$, i.e. $1\, 48\, 20$; the answer would be $80\, 50 + 54\, 10 = 2\, 15\, 00$ and $80\, 50 – 54\, 10 = 26\, 40$.

    Unfortunately, neither the problem on YBC 6967 nor any of the five igi, igi.bi problems recorded by Friberg from tablet MS 3971 correspond to parameters on Plimpton 322. It is possible that lines on the bottom and reverse of the tablet mean that it was supposed to be extended to additional 20 or so rows, where those missing parameters would appear. In fact, none of the proposed explanations is completely satisfactory. As Robson remarked, “The Mystery of the Cuneiform Tablet has not yet been fully solved.”

    In Islamic Mathematics

    Solving quadratic equations by completing the square was treated by Diophantus (c.200-c.285 AD) in his Arithmetica, but the explanations are in the six lost books of that work. Here we’ll look at the topic as covered by Muhammad ibn Musa, born in Khwarazm (Khiva in present-day Uzbekistan) and known as al-Khwarizmi, on his Compendium on Calculation by Completion and Reduction dating to c. 820 AD. (I’m using the translation published by Frederic Rosen in 1831). Negative numbers were still unavailable, so al-Khwarizmi, to solve a general quadratic, has to consider three cases. In each case he supposes a preliminary division has been done so the coefficient of the squares is equal to one (“Whenever you meet with a multiple or sub-multiple of a square, reduce it to the entire square”).

    1. “roots and squares are equal to numbers” [$x^2 + bx = a$]
    2. “squares and numbers are equal to roots” [$x^2 +a = bx$]
    3. “roots and numbers are equal to squares” [$x^2=bx+a$]

    Case 1. al-Khwarizmi works out a specific numerical example, which can serve as a template for any other equation of this form: “what must be the square which, when increased by ten of its roots, amounts to thirty-nine.”

    “The solution is this: you halve the number of the roots, which in the present case equals five. This you multiply by itself; the product is twenty-five. Add this to thirty-nine, the sum is sixty-four. Now take the root of this, which is eight, and subtract from it half the number of the roots, which is five; the remainder is three. This is the root of the square which you sought for; the square itself is nine.”

    Note that this is exactly the Old Babylonian recipe, updated from $x(x+7)=60$ to $x^2 +10x = 39$, and that the figure Eleanor Robson uses for her explanation is essentially identical to the one al-Khwarizmi gives for his second demonstration, reproduced here:

    al-Khwarizmi's first figure

    “We proceed from the quadrate AB, which represents the square. It is our next business to add to it the ten roots of the same. We halve for this purpose the ten, so it becomes five, and construct two quadrangles on two sides of the quadrate AB, namely, G and D, the length of each of them being five, as the moiety of the ten roots, whilst the breadth of each is equal to a side of the quadrate AB. Then a quadrate remains opposite the corner of the quadrate AB. This is equal to five multiplied by five: this five being half of the number of roots which we have added to each side of the first quadrate. Thus we know that the first quadrate, which is the square, and the two quadrangles on its sides, which are the ten roots, make together thirty-nine. In order to complete the great quadrate, there wants only a square of five multiplied by five, or twenty-five. This we add to the thirty-nine, in order to complete the great square SH. The sum is sixty-four. We extract its root, eight, which is one of the sides of the great quadrangle. By subtracting from this the same quantity which we have before added, namely five, we obtain three as the remainder. This is the side of the quadrangle AB, which represents the square; it is the root of this square, and the square itself is nine.”

    Case 2. “for instance, ‘a square and twenty-one in numbers are equal to ten roots of the same square.”

    Solution: Halve the number of the roots; the moiety is five. Multiply this by itself; the product is twenty-five. Subtract from this the twenty-one which are connected with the square; the remainder is four. Extract its root; it is two. Subtract this from the moiety of the roots, which is five; the remainder is three. This is the root of the square you required, and the square is nine.

    Here is a summary of al-Khwarizmi’s demonstration. The last of the four figures appears (minus the modern embellishments) in Rosen, p. 18.

    al-Khwarizmi's demonstration

    The problem set up geometrically. I have labeled the unknown root $x$ for modern convenience. The square ABCD has area $x^2$, the rectangle CHND has area $10x$, the rectangle AHNB has area 21, so $x^2+21=10x$.


    al-Khwarizmi's demonstration part 2

    The side CH is divided in half at G, so the segment AG measures $5-x$. The segment TG parallel to DC is extended by GK with length also $5-x$. Al-Khwarizmi says this is done “in order to complete the square.”


    al-Khwarizmi's demonstration part 3

    The segment TK then measures 5, so the figure KMNT, obtained by drawing KM parallel to GH and adding MH, is a square with area 25.


    al-Khwarizmi's demonstration part 4

    Measuring off KL equal to KG, and drawing LR parallel to KG leads to a square KLRG. Since HR has length $5-(5-x)=x$ the rectangles LMHR and AGTB have the same area, so the area of the region formed by adding LMHR to GHNT is the same as that of the rectangle formed by adding AGTB to GHNT, i.e. 21. And since that region together with the square KLRG makes up the square KMNT of area 25, it follows that the area of KLRG is $25-21=4$, and that its side-length $5-x$ is equal to 2. Hence $x=3$, and the sought-for square is 9.

     

    Al-Khwarizmi remarks that if you add that 2 to the length of CG then “the sum is seven, represented by the line CR, which is the root to a larger square,” and that this square is also a solution to the problem.

    Case 3. Example: “Three roots and four simple numbers are equal to a square.”

    Solution: “Halve the roots; the moiety is one and a half. Multiply this by itself; the product is two and a quarter. Add this to the four; the sum is six and a quarter. Extract its root; it is two and a half. Add this to the moiety of the roots, which was one and a half; the sum is four. This is the root of the square, and the square is sixteen.”

    As above, we summarize al-Khwarizmi’s demonstration. The last figure minus decoration appears on Rosen, p. 20.

    al-Khwarizmi's demonstration with total square'

    We represent the unknown square as ABDC, with side-length $x$. We cut off the rectangle HRDC with side-lengths 3 and $x$. Since $x^2 = 3x + 4$ the remaining rectangle ABRH has area 4.


    al-Khwarizmi's demonstration with total square part 2

    Halve the side HC at the point G, and construct the square HKTG. Since HG has length $1\frac{1}{2}$, the square HKTG has area $2\frac{1}{4}$.


    al-Khwarizmi's demonstration with total square part 3

    Extend CT by a segment TL equal to AH. Then the segments GL and AG have the same length, so drawing LM parallel to AG gives a square AGLM. Now TL = AH = MN and NL = HG = GC = BM, so the rectangles MBRN and KNLT have equal area, and so the region formed by AMNH and KNLT has the same area as ABRH, namely 4. It follows that the square AMLG has area $4+2\frac{1}{4}=6\frac{1}{4}$ and consequently side-length AG = $2\frac{1}{2}$. Since GC = $1\frac{1}{2}$, it follows that $x = 2\frac{1}{2} + 1\frac{1}{2} = 4$.


    Aryabhata and Brahmagupta

    The study of quadratic equations in India dates back to Aryabhata (476-550) and Brahmagupta (598-c.665). Aryabhata’s work on the topic was referenced at the time but is now lost; Brahmagupta’s has been preserved. He gives a solution algorithm in words (in verse!) which turns out to be equivalent to part of the quadratic formula—it only gives the root involving $+$ the radical. Here’s Brahmagupta’s rule with a translation, from Brahmagupta as an Algebraist (a chapter of Brahmasphutasiddhanta, Vol. 1):

    Brahmagupta's quadratic formula figure

    “The quadratic: the absolute quantities multiplied by four times the coefficient of the square of the unknown are increased by the square of the coefficient of the middle (i.e. unknown); the square-root of the result being diminished by the coefficient of the middle and divided by twice the coefficient of the square of the unknown, is (the value of) the middle.”

    There is only the rule, and no indication of how it was derived.

    In Fibonacci’s Liber Abaci

    The third section of chapter 15 of the Liber Abaci, written by Fibonacci (Leonardo of Pisa, c. 1170-c. 1245) in 1202, concerns “Problems of Algebra and Almuchabala,” referring directly to the Arabic words for “Completion and Reduction” in the title of al-Khwarizmi’s compendium. I am using the translation of Liber Abaci by L. E. Sigler, Springer, 2002. In that section, a short introduction presenting the methods is followed by a collection of over a hundred problems.

    Fibonacci follows al-Khwarizmi in distinguishing three “modes” of compound problems involving a square (“census”) and its roots (his second mode corresponds to al_Khwarizmi’s case 3 and vice-versa).

    • “the first mode is when census plus roots are equal to a number.” The example Fibonacci gives, “the census plus ten roots is equal to 39” is inherited directly from al Khwarizmi. But the demonstration he gives is different: he starts with a square $abcd$ of side length greater than 5, marks off lengths of 5 in two directions from one corner and thus partitions $abcd$ into a square of area 25, two rectangles and another square, which he identifies with the unknown census. Then the two rectangles add up to 10 times the root, but since “the census plus 10 roots is equal to 39 denari” the area of $abcd$ must be $25+39=64$, so its side length is 8, our root is $8-5=3$ and the census is 9. The figure and the calculation are essentially the same as al-Khwarizmi’s, but the “completion” narrative has disappeared.
    • “let the census be equal to 10 roots plus 39 denari.” Here the example is new.Fibonacci's root-finding figure(Figure from Sigler, p. 557). Fibonacci starts by representing the census as a square $abcd$ of side length larger than 10. He draws a segment $fe$ parallel to $ab$ so that the segments $fd$ and $ec$ each have length 10 and assigns a midpoint $g$ to $ec$. So the ten roots will be equal to the area of $fecd$, and the area of the rectangle $abfe$, which is also $fe\times eb$, is 39. But $fe=bc$, therefore $be \times bc = 39$. “If to this is added the square of the line segment $eg$, then 64 results for the square of the line segment $bg$.” [$bg^2 = be^2 + 2be\times eg +eg^2$ $= be\times(be + 2eg)+ eg^2 = be\times(be+eg+gc)+eg^2$ $= be\times bc +eg^2 = 39+25=64$]. So $bg = 8$ and $bc = bg+gc = 8+5=13$ is the root of the sought census, and the census is 169.
    • “when it will occur that the census plus a number will equal a number of roots, then you know that you can operate whenever the number is equal to or less than the square of half the number of roots.” Here Fibonacci goes beyond al-Khwarizmi in remarking that (in modern notation) $x^2+c=bx$ has no solution if $c>(b/2)^2$. The example he gives is “let the census plus 40 equal 14 roots.”Fibonacci's root-finding figure part 2(Figures from Sigler, pp. 557, 558). Fibonacci gives two construction for the two positive roots. In each case he starts with a segment $ab$ of length 14, with center at $g$. He chooses another point $d$ dividing $ab$ into two unequal parts.

      For the first root he constructs a square $dbez$ over $db$—this will be the census— and extends $ez$ to $iz$ matching $ab$. The rectangle $abzi$ now represents 14 roots, so that $adei$ measures (in modern notation) $14x – x^2 = 40$. Now $dg = 7-x$ so $dg^2 = 49-14x+x^2$ and $dg^2 + 40 = 49$. It follows that $dg=3$ and the root $x$ is $db=gb-gd=7-3=4$.

      For the second root the square $adlk$ is over $ad$; the segment $kl$ is extended to $km$ matching $ab$, and $lmbd$ measures $14x – x^2 = 40$. Now $gd = x-7$ so again $gd^2 = x^2-14x+49 = 9$ and $gd=3$. This time the root is $x=ag+gd = 10$.

     


    In Simon Stevin’s L’Arithmétique

    Stevin’s L’Arithmétique was published in 1594 (excerpts here are from the 1625 edition, supervised by Albert Girard). Among other things it treated quadratic equations. Stevin, following Bombelli, used a notation for powers that turns out to be intermediate between cossic notation (which used different symbols for the unknown, its square, its cube etc.) and the exponents that started with Descartes. For Stevin, the unknown was represented by a 1 in a circle, its square by a 2 in a circle, etc., and the numerical unit by a 0 in a circle. Here let us write ${\bf 1}$ for 1 in a circle, etc. He also had an idiosyncratic way of expressing the solution to an equation in one unknown, using the “fourth proportional.” For example, his setting of the problem the problem of finding $x$ such that $x^2=4x+12$ could be schematized as $${\bf 2} : 4\,{\bf 1} + 12\,{\bf 0} : : {\bf 1} : ?$$ (he would actually write: given the three terms for the problem — the first ${\bf 2}$, the second $4\,{\bf 1} + 12\,{\bf 0}$, the third ${\bf 1}$, find the fourth proportional). As Girard explains, the “proportion” is equality. So the problem should be read as: “if ${\bf 2} = 4\,{\bf 1} + 12\,{\bf 0}$, then ${\bf 1} =$ what?”

    Some writers have claimed “[Stevin’s Arithmetic] brought to the western world for the first time a general solution of the quadratic equation …” but in fact there is only this step towards modern notation, his use of minus signs in his equations and his admission of irrational numbers as coefficients to separate him from Fibonacci. Notably he also considers separately three types of quadratic equations [his examples, in post-Descartes notation]

    1. “second term ${\bf 1} + {\bf 0}$” [$x^2=4x+12$]
    2. “second term $-{\bf 1} + {\bf 0}$” [$x^2 = -6x + 16$]
    3. “second term ${\bf 1} – {\bf 0}$” [$x^2=6x-5$].

    and does not entertain negative solutions, so for the first equation he gives only $6$ and not $-2$, for the second he gives $2$ and not $-8$; for the third he gives the two (positive) solutions $5$ and $1$.

     

    Stevin gives a geometric justification for his solution of each type of equation. For example for the first type his solution is:

    Half of the 4 (from the $4\,{\bf 1}$) is 2
    Its square 4
    To the same added the given ${\bf 0}$, which is 12
    Gives the sum 16
    Its square root 4
    To the same added the 2 from the first line 6

    I say that 6 is the sought-for fourth proportional term.

    Here is his geometrical proof:

    Stevin's root-finding figure

    Figure from Stevin, L’Arithmétique p. 267.

    With modern notation: Stevin starts with a square ABCD representing $x^2$, so its side BC has length $x$. He draws EF parallel to AD, with AE $= 4$. So the rectangle ADFE measures $4x$ and then the rectangle EFCB has area $x^2-4x = 12$. Pick G the midpoint of AE, and draw the square GLKB.

    The rest of the argument as it appears in L’Arithmétique:

    Half of AE $=4$ which is GE 2
    Its square GEHI 4
    Add to the same the given ${\bf 0}$, i.e. EFCB 12
    Gives sum for the gnomon HIGBCF
    or for the square GBKL of same area 16
    Its root BK 4
    Add to the same GE$=2$ or instead KC = GE, makes for BC 6
    Q.E.D. [My translations -TP]

    Notice that the argument and even the diagram are inherited directly from al-Khwarizmi.

    Descartes’ La Géometrie

    René Descartes (1596-1650) published La Géometrie (1885 edition in modern French) in 1637. One of the first topics he covers is the solution of quadratic equations. Besides the actual geometry in his book, two notational and conceptual features started the modern era in mathematics. The first was his use of exponents for powers. This started as a typographical convenience: he still usually wrote $xx$ instead of $x^2$. It turned out to be a short series of steps (but one he could not have imagined) from his $x^3$ to Euler’s $e^x$. The second was his use of $x$ and $y$ as coordinates in the plane. The first would eventually allow the general quadratic equation to be written as $ax^2 + bx +c =0$, and the second would allow the solution to be viewed as the intersection of a parabola with the $x$-axis. But Descartes’ avoidance of negative numbers (the original cartesian plane was a quadrant) kept him from the full exploitation of his own inventions.

    In particular, he still had to follow al-Khwarizmi and Stevin in distinguishing three forms of quadratic equations. In his notation:

    • $z^2=az + b^2$
    • $y^2=-ay + b^2$
    • $z^2 = az-b^2$.

     

    His solutions, however, are completely different from the earlier methods. He shows in all three cases how a ruler-and-compass construction can lead from the lengths $a$ and $b$ to a segment whose length is the solution.

    Descartes' root-finding figure

    Cases 1 and 2. “I construct the right triangle NLM with one leg $LM$ equal to $b$, and the other $LN$ is $\frac{1}{2}a$, half of the other known quantity which was multiplied by $z$, which I suppose to be the unknown length; then extending $MN$, the hypothenuse of this triangle, to the point $O$, such that $NO$ is equal to $NL$, the entirety $OM$ is the sought-for length; and it can be expressed in this way: $$z=\frac{1}{2}a + \sqrt{\frac{1}{4}aa + bb}.$$ Whereas if I have $yy=-ay+bb$, with $y$ the unknown quantity, I construct the same right triangle $NLM$, and from the hypothenuse $MN$ I subtract $NP$ equal to $NL$, and the remainder $PM$ is $y$, the sought-for root. So that I have $y=-\frac{1}{2}a + \sqrt{\frac{1}{4}aa + bb}.$”

    Descartes' root-finding part 2

    Case 3. “Finally, if I have $$z^2=az-bb$$ I draw $NL$ equal to $\frac{1}{2}a$ and $LM$ equal to $b$, as before; then, instead of joining the points $MN$, I draw $MQR$ parallel to $LN$, and having drawn a circle with center $N$ which cuts it at the points $Q$ and $R$, the sought-for length is $MQ$, or $MR$; since in this case it can be expressed two ways, to wit, $z=\frac{1}{2}a + \sqrt{\frac{1}{4}aa – bb}$ and $z=\frac{1}{2}a – \sqrt{\frac{1}{4}aa – bb}.$ And if the circle, which has its center at $N$ and passes through $L$ neither cuts nor touches the straight line $MQR$, the equation has no root, so one can state that the construction of the proposed problem is impossible.” [My translation. Checking Descartes’ constructions involves some standard Euclidean geometry.-TP]

    In Maria Gaetana Agnesi’s Instituzioni analitiche ad uso della gioventù italiana

    Agnesi’s textbook was published in 1748. Agnesi does algebraically complete the square for one case of the quadratic equation:

    “Consider the equation $xx+2ax = bb$; add to one and to the other sides the square of half the coefficient of the second term, i.e. $aa$, and it becomes $xx+2ax+aa = aa + bb$ and, extracting the root, $x=\pm\sqrt{aa+bb} -a$.”

    Specifically (with the letters Descartes used) she takes $a$ as a positive quantity (necessary for the geometric construction) and gives

    • MP and $-$MO as roots of $xx+ax-bb=0$
    • MO and $-$MP as roots of $xx-ax-bb=0$
    • $-$MQ and $-$MR as roots of $xx+ax+bb=0$
    • MQ and MR as roots of $xx-ax+bb=0$.

    Imaginary numbers do not yet have a geometric meaning. “Whenever the equation, to which the particulars of the problems have led us, produces only imaginary values, this means that the problem has no solution, and that it is impossible.” [My translations -TP]

    Note that she uses the $\pm$ notation. She explains: “So the ambiguity of the sign affected to the square root implies two values for the unknown, which can be both positive, both negative, one positive and the other negative, or even both imaginary, depending on the quantities from which they have been computed.”

    But when Agnesi comes to a general treatment she follows Descartes, using identical geometric constructions, with one significant improvement, as implied above: negative roots are calculated and given the same status as positive ones:

    “Negative values, which are still called false, are no less real than the positive, and have the only difference, that if in the solution to the problem the positives are taken from the fixed starting point of the unknown toward one side, the negatives are taken from the same point in the opposite direction.”

    In Leonhard Euler’s Vollständige Anleitung zur Algebra

    Euler’s text was published in St. Petersburg in 1770; John Hewitt’s English translation, Elements of Algebra, appeared in 1840. Chapter VI covers the general quadratic equation: Euler writes it as $ax^2\pm bx\pm c=0$, and then remarks that it can always be put in the form $x^2 + px = q$, where $p$ and $q$ can be positive or negative. He explains how the left-hand side can be made into the square of $(x+\frac{1}{2}p)$ by adding $\frac{1}{4}p^2$ to both sides, leading to $x+\frac{1}{2}p = \sqrt{\frac{1}{4}p^2 + q}$ and, “as every square root may be taken either affirmitively or negatively,” $$x = -\frac{1}{2}p \pm \sqrt{\frac{1}{4}p^2 + q}.$$ In deriving this solution, completely equivalent to the quadratic formula, Euler has completed the square in a purely algebraic manner. The gnomons and Euclidean diagrams, that for some 2500 years had seemed necessary to justify the maneuver, have evaporated.

    In Elements of Algebra Euler is receptive to imaginary numbers:

    §145. “But notwithstanding [their being impossible] these numbers present themselves to the mind; they exist in our imagination, and we still have a sufficient idea of them; since we know that by $\sqrt{-4}$ is meant a number which, multiplied by itself, produces $-4$; for this reason also, nothing prevents us from making use of these imaginary numbers, and employing them in calculation.”

    but does not consider them “possible” as roots of quadratic equations. For example:

    §700. “A very remarkable case sometimes occurs, in which both values of $x$ become imaginary, or impossible; and it is then wholly impossible to assign any value for $x$, that would satisfy the terms of the equation.”

    In 20th century physics

    A Gaussian function $f(x)=C\exp(-\frac{1}{2}ax^2)$ corresponds to a familiar “bell-shaped curve.” In multivariable calculus we learn that $\int_{-\infty}^{\infty}\,f(x)\,dx=C\sqrt{\frac{2\pi}{a}}$; this also holds for $\int_{-\infty}^{\infty}\,f(x-\mu)\,dx$, with $\mu$ any finite number.

    Gaussian curve

    The width of the Gaussian $f(x)=C\exp(-\frac{1}{2}ax^2)$, defined as the distance between the two points where $\,f=\frac{C}{2}$, can be calculated to be  $2\sqrt{\frac{2\ln 2}{a}}$. In this image with $C=4$, $a=\frac{1}{2}$, the width is 3.33. Note that the width does not depend on the factor $C$.

    The Fourier transform of $f$ (taking $C=1$) is the function $${\hat f}(y)= \int_{-\infty}^{\infty}\exp(ixy)\,f(x)~dx = \int_{-\infty}^{\infty}\exp(-\frac{1}{2}ax^2 + ixy)~dx.$$ This integral can be computed by completing the square: write $-\frac{1}{2}ax^2 + ixy$ as $-\frac{1}{2}a(x^2 -\frac{2iyx}{a} +\frac{y^2}{a^2}-\frac{y^2}{a^2})= -\frac{1}{2}a (x-\frac{iy}{a})^2 – \frac{y^2}{2a}$. Then $$ {\hat f}(y)= \int_{-\infty}^{\infty}\exp(-\frac{y^2}{2a})\,\exp\left (-\frac{1}{2}a(x-\frac{iy}{a})^2\right )\,dx=\sqrt{\frac{2\pi}{a}}\,\exp(-\frac{y^2}{2a}).$$ This means that the Fourier transform of $f$ is again a Gaussian; the parameter $a$ has become $\frac{1}{a}$, so the product of the widths of $\,f$ and its Fourier transform ${\,\hat f}$ is constant. The wider $\,f$ is, the narrower ${\,\hat f}$ must be, and vice-versa. This phenomenon is the mathematical form of the uncertainty principle.

AMS Feature Column banner

Pooling strategies for COVID-19 testing

How could 10 tests yield accurate results for 100 patients?

David Austin
Grand Valley State University

While the COVID-19 pandemic has brought tragedy and disruption, it has also provided a unique opportunity for mathematics to play an important and visible role in addressing a pressing issue facing our society.

By now, it’s well understood that testing is a crucial component of any effective strategy to control the spread of the SARS-CoV-2 coronavirus. Countries that have developed effective testing regimens have been able, to a greater degree, to resume normal activities, while those with inadequate testing have seen the coronavirus persist at dangerously high levels.

Developing an effective testing strategy means confronting some important challenges. One is producing and administering enough tests to form an accurate picture of the current state of the virus’ spread. This means having an adequate number of trained health professionals to collect and process samples along with an adequate supply of reagents and testing machines. Furthermore, results must be available promptly. A person is who unknowingly infected can transmit the virus to many others in a week, so results need to be available in a period of days or even hours.

One way to address these challenges of limited resources and limited time is to combine samples from many patients into testing pools strategically rather than testing samples from each individual patient separately. Indeed, some well-designed pooling strategies can decrease the number of required tests by a factor of ten; that is, it is possible to effectively test, say, 100 patients with a total of 10 tests.

On first thought, it may seem like we’re getting something from nothing. How could 10 tests yield accurate results for 100 patients? This column will describe how some mathematical ideas from compressed sensing theory provide the key.

One of the interesting features of the COVID-19 pandemic is the rate at which we are learning about it. The public is seeing science done in public view and in real time, and new findings sometimes cause us to amend earlier recommendations and behaviors. This has made the job of communicating scientific findings especially tricky. So while some of what’s in this article may require updating in the near future, our aim is rather to focus on mathematical issues that should remain relevant and trust the reader to update as appropriate.

Some simple pooling strategies

While the SARS-CoV-2 virus is new, the problem of testing individuals in a large population is not. Our story begins in 1943 when Robert Dorfman proposed the following simple method for identifying syphilitic men called up for induction through the war time draft.

syphilis testing public service poster

A 1941 poster encourages syphilis testing (Library of Congress)

Suppose we have samples from, say, 100 patients. Rather than testing each of the samples individually, Dorfman suggested grouping them into 10 pools of 10 each and testing each pool.

If the test result of a pool is negative, we conclude that everyone in that pool is free of infection. If a pool tests positively, then we test each individual in the pool.

In the situation illustrated above, two of the 100 samples are infected, so we perform a total of 30 tests to identify them: 10 tests for the original 10 pools followed by 20 tests for each member of the two infected pools. Here, the number of tests performed is 30% of the number required had we tested each individual separately.

Of course, there are situations where this strategy is disadvantageous. If there is an infected person in every pool, we end up performing the original 10 tests and follow up by then testing each individual. This means we perform 110 tests, more than if we had just tested everyone separately.

What’s important is the prevalence $p$ of the infection, the expected fraction of infected individuals we expect to find or the probability that a random individual is infected. If the prevalence is low, it seems reasonable that Dorfman’s strategy can lead to a reduction in the number of tests we expect to perform. As the prevalence grows, however, it may no longer be effective.

It’s not too hard to find how the expected number of tests per person varies with the prevalence. If we arrange $k^2$ samples into $k$ pools of $k$ samples each, then the expected number of tests per person is $$ E_k = \frac1k + (1-(1-p)^k). $$ When $k=10$, the expected number $E_{10}$ is shown on the right. Of course, testing each individual separately means we use one test per person so Dorfman’s strategy loses its advantage when $E_k\geq 1$. As the graph shows, when $p>0.2$, meaning there are more than 20 infections per 100 people, we are better off testing each individual.

Fortunately, the prevalence of SARS-CoV-2 infections is relatively low in the general population. As the fall semester began, my university initiated a randomized testing program that showed the prevalence in the campus community to be around $p\approx 0.01$. Concern is setting in now that that number is closer to 0.04. In any case, we will assume that the prevalence of infected individuals in our population is low enough to make pooling a viable strategy.

Of course, no test is perfect. It’s possible, for instance, that an infected sample will yield a negative test result. It’s typical to characterize the efficacy of a particular testing protocol using two measures: sensitivity and specificity. The sensitivity measures the probability that a test returns a positive result for an infected sample. Similarly, the specificity measures the probability that a test returns a negative result when the sample is not infected. Ideally, both of these numbers are near 100%.

Using Dorfman’s pooling method, the price we pay for lowering the expected number of tests below one is a decrease in sensitivity. Identifying an infected sample in this two-step process requires the test to correctly return a positive result both times we test it. Therefore, if $S_e$ is the sensitivity of a single test, Dorfman’s method has a sensitivity of $S_e^2$. For example, a test with a sensitivity of 95% yields a sensitivity around 90% when tests are pooled.

There is, however, an increase in the specificity. If a sample is not infected, testing a second time increases the chance that we detect it as such. One can show that if the sensitivity and specificity are around 95% and the prevalence at 1%, then pooling 10 samples, as shown above, raises the specificity to around 99%.

 

Some modifications of Dorfman’s method

 

It’s possible to imagine modifying Dorfman’s method in several ways. For instance, once we have identified the infected pools in the first round of tests, we could apply a pooling strategy on the smaller set of samples that still require testing.

A second possibility is illustrated below where 100 samples are imagined in a square $10\times10$ array. Each sample is included in two pools according to its row and column so that a total of 20 tests are performed in the first round. In the illustration, the two infected samples lead to positive results in four of the pools, two rows and two columns.

We know that the infected samples appear at the intersection of these two rows and two columns, which leads to a total of four tests in the second round. Once again, it’s possible to express $E_k$, the number of expected tests per individual in terms of the prevalence $p$. If we have $k^2$ tests arranged in a $k\times k$ array, we see that $$ E_k =\frac2k + p + (1-p)(1-(1-p)^{k-1}, $$ if we assume that the sensitivity and specificity are 100%.

The graph at right shows the expected number of tests using the two-dimensional array, assuming $k=10$, in red with the result using Dorfman’s original method in blue. As can be seen, the expected number of tests is greater using the two-dimensional approach since we invest twice as many tests in the first round of testing. However each sample is included in two tests in the initial round. For an infected sample to be misidentified, both tests would have to return negative results. This means that the two-dimensional approach is desirable because the sensitivity of this strategy is greater than the sensitivity of the individual tests and we still lower the expected number of tests when the prevalence is low.

While it is important to consider the impact that any pooling strategy has on these important measures, our focus will, for the most part, take us away from discussions of specificity and sensitivity. See this recent Feature column for a deep dive into their relevance.

 

Theoretical limits

 

There has been some theoretical work on the range of prevalence values over which pooling strategies are advantageous. In the language of information theory, we can consider a sequence of test results as an information source having entropy $I(p) = -p\log_2(p) – (1-p)\log_2(1-p)$. In this framework, a pooling strategy can be seen as an encoding of the source to effectively compress the information generated.

Sobel and Groll showed that $E$, the expected number of tests per person, for any effective pooling method must satisfy $E \geq I(p)$. On the right is shown this theoretical limit in red along with the expected number of tests under the Dorfman method with $k=10$.

Further work by Ungar showed that when the prevalence grows above the threshold $p\geq (3-\sqrt{5})/2 \approx 38\%$, then we cannot find a pooling strategy that is better than simply testing everyone individually.

 

RT-qPCR testing

 

While there are several different tests for the SARS-CoV-2 virus, at this time, the RT-qPCR test is considered the “gold standard.” In addition to its intrinsic interest, learning how this test works will help us understand the pooling strategies we will consider next.

A sample is collected from a patient, often through a nasal swab, and dispersed in a liquid medium. The test begins by converting the virus’ RNA molecules into complementary DNA through a process known as reverse transcription (RT). A sequence of amplification cycles, known as quantitative polymerase chain reaction (qPCR) then begins. Each cycle consists of three steps:

  • The liquid is heated close to boiling so that the transcribed DNA denatures into two separate strands.
  • Next the liquid is cooled so that a primer, which has been added to the liquid, attaches to a DNA strand along a specific sequence of 100-200 nucleotides. This sequence characterizes the complementary DNA of the SARS-CoV-2 virus and is long enough to uniquely identify it. This guarantees that the test has a high sensitivity. Attached to the primer is a fluorescent marker.
  • In a final warming phase, additional nucleotides attach to the primer to complete a copy of the complementary DNA molecule.

Through one of these cycles, the number of DNA molecules is essentially doubled.

 

The RT-qPCR test takes the sample through 30-40 amplification cycles resulting in a significant increase in the number of DNA molecules, each of which has an attached fluorescent marker. After each cycle, we can measure the amount of fluorescence and translate it into a measure of the number of DNA molecules that have originated from the virus.

The fluorescence, as it depends on the number of cycles, is shown below. A sample with a relatively high viral load will show significant fluorescence at an early cycle. The curves below represent different samples and show how the measured fluorescence grows through the amplification cycles. Moving to the right, each curve is associated with a ten-fold decrease in the initial viral load of the sample. Taken together, these curves represent a range of a million-fold decrease in the viral load. In fact, the test is sensitive enough to detect a mere 10 virus particles in a sample.

The FDA has established a threshold, shown as the red horizontal line, above which we can conclude that the SARS-CoV-2 virus is present in the original sample. However, the test provides more than a binary positive/negative result; by matching the fluorescence curve from a particular sample to the curves above, we can infer the viral load present in the original sample. In this way, the test provides a quantitative measure of the viral load that we will soon use in developing a pooling method.

Pooling samples from several individuals, only one of whom is infected, will dilute the infected sample. The effect is simply that the fluorescence response crosses the FDA threshold in a later cycle. There is a limit, however. Because noise can creep into the fluorescence readings around cycle 40, FDA standards state that only results from the first 39 cycles are valid.

Recent studies by Bilder and Yelin et al investigated the practical limits of pooling samples in the RT-qPCR test and found that a single infected sample can be reliably detected in a pool of up to 32. (A recent study by the CDC, however, raises concerns about detecting the virus using the RT-qPCR test past the 33rd amplification cycle. )

Non-adaptive testing strategies

Dorfman’s pooling method and its variants described above are known as adaptive methods because they begin with an initial round of tests and use those results to determine how to proceed with a second round. Since the RT-qPCR test requires 3 – 4 hours to complete, the second round of testing causes a delay in obtaining results and ties up testing facilities and personnel. A non-adaptive method, one that produces results for a group of individuals in a single round of tests, would be preferable.

Several non-adaptive methods have recently been proposed and are even now in use, such as P-BEST. The mathematical ideas underlying these various methods are quite similar. We will focus on one called Tapestry.

We first collect samples from $N$ individuals and denote the viral loads of each sample by $x_j$. We then form these samples into $T$ pools in a manner to be explained a little later. This leads to a pooling matrix $A_i^j$ where $A_i^j = 1$ if the sample from individual $j$ is present in the $i^{th}$ test and 0 otherwise. The total viral load in the $i^{th}$ test is then $$ y_i = \sum_{j} A_i^j x_j, $$ which can be measured by the RT-qPCR test. In practice, there will be some uncertainty in measuring $y_i$, but it can be dealt with in the theoretical framework we are describing.

Now we have a linear algebra problem. We can express the $T$ equations that result from each test as $$ \yvec = A\xvec, $$ where $\yvec$ is the known vector of test results, $A$ is the $T\times N$ pooling matrix, and $\xvec$ is the unknown vector of viral loads obtained from the patients.

Because $T\lt N$, this is an under-determined system of equations, which means that we cannot generally expect to solve for the vector $\xvec$. However, we have some additional information: because we are assuming that the prevalence $p$ is low, the vector $\xvec$ will be sparse, which means that most of its entries are zero. This is the key observation on which all existing non-adaptive pooling methods rely.

It turns out that this problem has been extensively studied within the area of compressed sensing, a collection of techniques in signal processing that allow one to reconstruct a sparse signal from a small number of observations. Here is an outline of some important ideas.

First, we will have occassion to consider a couple of different measures of the size of a vector.

  • First, $\norm{\zvec}{0}$ is the number of nonzero entries in the vector $\zvec$. Because the prevalence of SARS-CoV-2 positive samples is expected to be small, we are looking for a solution to the equation $\yvec=A\xvec$ where $\norm{\xvec}{0}$ is small.
  • The 1-norm is $$ \norm{\zvec}{1} = \sum_j~|z_j|, $$
  • and the 2-norm is the usual Euclidean length: $$ \norm{\zvec}{2} = \sqrt{\sum_j z_j^2} $$

Remember that an isometry is a linear transformation that preserves the length of vectors. With the usual Euclidean length of a vector $\zvec$ written as $||\zvec||_2$, then the matrix $M$ defines an isometry if $||M\zvec||_2 = ||\zvec||_2$ for all vectors $\zvec$. The columns of such a matrix form an orthonormal set.

We will construct our pooling matrix $A$ so that it satisfies a restricted isometry property (RIP), which essentially means that small subsets of the columns of $A$ are almost orthonormal. More specifically, if $R$ is a subset of $\{1,2,\ldots, N\}$, we denote by $A^R$ the matrix formed by pulling out the columns of $A$ labelled by $R$; for instance, $A^{\{2,5\}}$ is the matrix formed from the second and fifth columns of $A$. For a positive integer $S$, we define a constant $\delta_S$ such that $$ (1-\delta_S)||\xvec||_2 \leq ||A^R\xvec||_2 \leq (1+\delta_S)||\xvec||_2 $$ for any set $R$ whose cardinality is no more than $S$. If $\delta_S = 0$, then the matrices $A^R$ are isometries, which would imply that the columns of $A^R$ are orthonormal. More generally, the idea is that when $\delta_S$ is small, then the columns of $A^R$ are close to being orthonormal.

Let’s see how we can use these constants $\delta_S$.

Because we are assuming that the prevalence $p$ is low, we know that $\xvec$, the vector of viral loads, is sparse. We will show that a sufficiently sparse solution to $\yvec = A\xvec$ is unique.

For instance, suppose that $\delta_{2S} \lt 1$, that $\xvec_1$ and $\xvec_2$ are two sparse solutions to the equation $\yvec = A\xvec$, and that $\norm{\xvec_1}{0}, \norm{\xvec_1}{0} \leq S$. The last condition means that $\xvec_1$ and $\xvec_2$ are sparse in the sense that they have fewer than $S$ nonzero components.

Now it follows that $A\xvec_1 = A\xvec_2 = \yvec$ so that $A(\xvec_1-\xvec_2) = 0$. In fact, if $R$ consists of the indices for which the components of $\xvec_1-\xvec_2$ are nonzero, then $A^R(\xvec_1-\xvec_2) = 0$.

But we know that the cardinality of $R$ equals $\norm{\xvec_1-\xvec_2}{0} \leq 2S$, which tells us that $$ 0 = ||A^R(\xvec_1-\xvec_2)||_2 \geq (1-\delta_{2S})||\xvec_1-\xvec_2||_2. $$ Because $\delta_{2S}\lt 1$, we know that that $\xvec_1 – \xvec = 0$ or $\xvec_1 = \xvec_2$.

Therefore, if $\delta_{2S} \lt 1$, any solution to $\yvec=A\xvec$ with $\norm{\xvec}{0} \leq S$ is unique; that is, any sufficiently sparse solution is unique.

Now that we have seen a condition that implies that sparse solutions are unique, we need to explain how we can find sparse solutions. Candès and Tao show, assuming $\delta_S + \delta_{2S} + \delta_{3S} \lt 1$, how we can find a sparse solution to $\yvec = A\xvec$ with $\norm{\xvec}{0} \leq S$ by minimizing: $$ \min \norm{\xvec}{1} ~~~\text{subject to}~~~ \yvec = A\xvec. $$ This is a convex optimization problem, and there are standard techniques for finding the minimum.

Why is it reasonable to think that minimizing $\norm{x}{1}$ will lead to a sparse solution? Let’s think visually about the case where $\xvec$ is a 2-dimensional vector. The set of all $\xvec$ satisfying $\norm{\xvec}{1} = |x_1| + |x_2| \leq C$ for some constant $C$ is the shaded set below:

Notice that the corners of this set fall on the coordinate axes where some of the components are zero. If we now consider solutions to $\yvec=A\xvec$, seen as the line below, we see that the solutions where $\norm{\xvec}{1}$ is minimal fall on the coordinates axes. This forces some of the components of $\xvec$ to be zero and results in a sparse solution.

This technique is related to one called the lasso (least absolute shrinkage and selection operator), which is well known in data science where it is used to eliminate unnecessary features from a data set.

All that remains is for us to find a pooling matrix $A$ that satisfies $\delta_{S} + \delta_{2S} + \delta_{3S} \lt 1$ for some $S$ large enough to find vectors $\xvec$ whose sparsity $\norm{\xvec}{0}$ is consistent with the expected prevalence. There are several ways to do this. Indeed, a matrix chosen at random will work with high probability, but the application to pooling SARS-CoV-2 samples that we have in mind leads us to ask that $A$ satisfy some additional properties.

The Tapestry method uses a pooling matrix $A$ formed from a Steiner triple system, an object studied in combinatorial design theory. For instance, one of Tapestry’s pooling matrices is shown below, where red represents a 1 and white a 0.

This is a $16\times40$ matrix, which means that we perform 16 tests on 40 individuals. Notice that each individual’s sample appears in 3 tests. This is a relatively low number, which means that the viral load in a sample is not divided too much and that, in the laboratory, time spent pipetting the samples is minimzed. Each test consists of samples from about eight patients, well below the maximum of 32 needed for reliable RT-qPCR readings.

It is also important to note that two samples appear together in at most one test. Therefore, if $A^j$ is the $j^{th}$ column of $A$, it follows that the dot product $A^j\cdot A^k = 0$ or $1$. This means that two columns are either orthogonal or span an angle of $\arccos(1/3) \approx 70^\circ$. If we scale $A$ by $1/\sqrt{3}$, we therefore obtain a matrix whose columns are almost orthonormal and from which we can derive the required condition, $\delta_S + \delta_{2S} + \delta_{3S} \lt 1$ for some sufficiently large value of $S$.

There is an additional simplification we can apply. For instance, if we have a sample $x_j$ that produces a negative result $y_i=0$ in at least one test in which it is included, then we can conclude that $x_j = 0$. This means that we can remove the component $x_j$ from the vector $\xvec$ and the column $A^j$ from the matrix $A$. Removing all these sure negatives often leads to a dramatic simplication in the convex optimization problem.

Tapestry has created a variety of pooling matrices that can be deployed across a range of prevalences. For instance, a $45\times 105$ pooling matrix, which means we perform 45 tests on 105 individuals, is appropriate when the prevalence is roughly 10%, a relatively high prevalence.

However, there is also a $93\times 961$ pooling matrix that is appropriate for use when the prevalence is around 1%. Here we perform 93 tests on 961 patients in pools of size 32, which means we can test about 10 times the number of patients with a given number of tests. This is a dramatic improvement over performing single tests on individual samples.

If the prevalence turns out to be too high for the pooling matrix used, the Tapestry algorithm detects it and fails gracefully.

Along with non-adaptive methods comes an increase in the complexity of their implementation. This is especially concerning since reliability and speed are crucial. For this reason, the Tapestry team built an Android app that guides a laboratory technician though the pipetting process, receives the test results $y_i$, and solves for the resulting viral loads $x_j$ returning a list of positive samples.

Using both simulated and actual lab data, the authors of Tapestry studied the sensitivity and specificity of their algorithm and found that it performs well. They also compared the number of tests Tapestry performs with Dorfman’s adaptive method and found that Tapestry requires many fewer tests, often several times fewer, in addition to finishing in a single round.

Summary

As we’ve seen here, non-adaptive pooling provides a significant opportunity to improve our testing capacity by increasing the number of samples we can test, decreasing the amount of time it takes to obtain results, and decreasing the costs of testing. These improvements can play an important role in a wider effort to test, trace, and isolate infected patients and hence control the spread of the coronavirus.

In addition, the FDA recently gave emergency use authorization for the use of these ideas. Not only is there a practical framework for deploying the Tapestry method, made possible by their Android app, it’s now legal to do so.

Interestingly, the mathematics used here already existed before the COVID-19 pandemic. Dating back to Dorfman’s original work of 1943, group pooling strategies have continued to evolve over the years. Indeed, the team of Shental et al. introduced P-BEST, their SARS-CoV-2 pooling strategy, as an extension of their earlier work to detect rare alleles associated to some diseases.

Thanks

Mike Breen, recently of the AMS Public Awareness Office, oversaw the publication of the monthly Feature Column for many years. Mike retired in August 2020, and I’d like to thank him for his leadership, good judgment, and never-failing humor.

References

AMS Feature Column banner

Sensitivity, Specificity, and COVID-19 Testing

As every doctor should know, medical tests are rarely 100% accurate. Interpreting them is a matter of probability. And probability, of course, is a matter of mathematics…

Bill Casselman
University of British Columbia, Vancouver, Canada

Well, make up your mind. Does he have it or not?

I remarked in an earlier column that there were two major ways in which mathematics has contributed to our understanding of the disease COVID-19 and the coronavirus SARS-CoV-2 that causes it. One was by modeling the development of the epidemic in order to predict possible outcomes, and the other was by uncovering the sources of particular outbreaks by tracking the evolution of genomes. This column will be concerned with a third way in which mathematics helps to control the epidemic, and that is by interpreting procedures that test for the presence of the disease.

The basic problem was illustrated dramatically at the beginning of August. The President of the United States was about to visit the state of Ohio, and naturally it was expected that he and the governor of Ohio (Mike DeWine, a Republican) would meet. As a matter of routine, the governor was tested for coronavirus before the meeting, and the test returned a positive result. It was a definite possibility, even likely, that DeWine was afflicted with Covid-19. He put himself into quarantine immediately, and of course the meeting with the President had to be called off. A short time later a second, presumably more accurate, test was given and this time it came back negative. This second result was later confirmed by third and fourth tests. (You can read more about this incident in an article in The New York Times, or also another NYT article.)

It was suggested in the media (the Washington Post, among others) that this sequence of events would reduce the trust of the population at large in the validity of such tests, and—very, very unfortunately—discourage many from having them. The governor himself, one of the state governors to take the epidemic extremely seriously, stated very clearly that in his opinion avoiding tests was not at all a good idea. Medical experts agree—testing is in fact a crucial tool in controlling the epidemic, and the United States is almost certainly not doing enough of it. At the very least it is doing it awkwardly. It is therefore important to try to understand better what is going on.

The basic fact is very simple: as every doctor should know, medical tests are rarely 100% accurate. Interpreting them is a matter of probability. And probability, of course, is a matter of mathematics.

images of test tubes
COVID-19 test image from Wikimedia Commons.

Measuring a test’s accuracy

Many medical tests are abstractly similar: they give a yes/no answer to a question about the status of a patient. In the case of the governor of Ohio, the question is, “Does this person have COVID-19?” His first test answered “yes”, while the later three answered “no”. Clearly, they were not all correct. The tests are certainly inaccurate a certain percentage of the time.

How is the accuracy of such tests measured? There is no single number that does this. Instead, there are two: sensitivity and specificity. To quote one instructive web page, sensitivity and specificity “are the yin and yang of the testing world and convey critical information about what a test can and cannot tell us. Both are needed to fully understand a test’s strengths as well as its shortcomings.”

Sensitivity characterizes how a test deals with people who are infected. It is the percentage of those who test as infected when that they are in fact infected. A test with high sensitivity will catch almost every infected person, and will hence allow few of those infected to escape detection. This is the most important criterion. For example, it was reported recently that Spanish health authorities returned thousands of tests to one firm after finding that its tests had sensitivity equal to 30%. This means that 70% of infected people would not be diagnosed correctly by this test.

Specificity characterizes how a test deals with people who are not infected. It is the percentage of those who will test as uninfected when they are in fact uninfected. A test with high specificity will hence report a small number of false positives.

For detecting currently infected people, high sensitivity is important, because infected people who avoid detection (false negatives) are potential sources of further infection. High specificity is not so crucial. Tagging uninfected as infected (making false positives) can be unfortunate, but not with possibly fatal consequences.

It might not be immediately apparent, but sensitivity and specificity are independent features. For example, it could happen that a test lazily returns a “yes” no matter what the condition of the patient happens to be. In this case, it will have sensitivity 100%, but specificity 0%. Or it could behave equally lazily, but in the opposite way, and return a “no” in all cases: sensitivity 0% and specificity 100%. Both of these, would be extremely poor tests, but these simple examples demonstrate that the two measures are in some way complementary. (Hence the yin and yang.) We shall see later on another illustration of complementarity.

For COVID-19, the most commonly used tests are of two very different kinds – one attempts to tell whether the tested patient is currently infected, while the second detects antibodies against COVID-19. Tests in this second group do not detect whether the patient is currently infected, but only whether or not the patient has been infected in the past.

In the first group of tests, which detect current infection, there is again a division into two types. One, called a PCR test, looks for genetic fragments of the virus. The other tries to find bits of the virus’ protein component. This last is called an antigen test, because it is this protein component that a body’s immune system interacts with. (An antigen, according to Wikipedia, is a foreign substance invading the body and stimulating a response of the immune system.) Tests of the first type are generally more accurate than those of the second, but tests of the second type produce results much more rapidly – in minutes rather than hours or even days.

None of these tests is guaranteed to return correct results.

The first test taken by Governor DeWine was an antigen test, and his later tests were all PCR. In all these tests, samples were taken by nasal swabs. The antigen test was relatively new, and produced by the company Quidel. Initial versions were claimed to possess a sensitivity of around 80%, but more recent ones are claimed to have about 97% sensitivity, which is certainly comparable with PCR tests. They also claimed from the start a specificity of 100%.

The outcome of tests

What do these numbers mean? Given that Governor DeWine tested positive, what is the probability that he is infected with COVID-19?

One way to figure this out is to run some mental experiments. Suppose we give the tests to 1000 people chosen at random from a US population. What happens? It turns out that the answer to this depends on the infection level of the population, as we shall see. In other words, in order to carry out this experiment, I have to assume something about this level. Like many other things about COVID-19, this statistic is not known accurately, and in any case varies widely from place to place—perhaps, at this stage of the epidemic, a plausible range is between 1% and 15%. I’ll try various guesses, using the values of sensitivity and specificity given by Quidel.

1. Suppose at first 5% of the population to be infected.

  • In the sample of 1000, there will be around 50 who are currently infected.
  • Because the sensitivity of the test is 97, of these about 48 will be labeled as positive, and the remaining 2 will not be correctly detected.
  • But there remain 950 people in the sample who are not infected. For these it is the specificity that matters. I have assumed this to be 100%, which means that 100% of those who are not infected will be labeled as uninfected, and 0% will be found infected.
  • Therefore, with these values of sensitivity and specificity, a nominal 48 will be found to be infected, and $850 + 2$ uninfected.

Wait a minute! Governor DeWine turned out ultimately not to be infected, but he was nonetheless found to be infected by the test. This contradicts the claim of 100% specificity! Something has gone wrong.

It is important to realize that the assigned values of sensitivity and specificity are somewhat theoretical, valid only if the test is applied in ideal conditions. Often in the literature these are called clinical data. But in the real world there are no ideal conditions, and the clinical specifications are not necessarily correct. Many errors are related to the initial sample taking, and indeed it is easy enough to imagine how a swab could miss crucial material. However, it is hard to see how an error could turn a sample without any indication of COVID-19 to one with such an indication. In any case, following the inevitable Murphy’s Law, errors will certainly degrade performance.

2. This suggests trying out some slightly less favorable values for sensitivity and specificity, say 90% sensitivity and 95% specificity, but with the same 5% infection rate.

  • In this case, 45 out of the true 50 infected will be caught, and 5 who will not be tagged.
  • There will still be 950 who are not infected, but 5% = (100 – 95)% of these, i.e. about 48, will return positive.
  • That makes another 48, and a total of 93 positive test results.
  • In this experiment, Governor DeWine is one of 93, of whom 45 are infected, 48 not. Given the data in hand, it makes sense to say that the probability he is one of the infected is 45/93 = 0.49 or 49%. Definitely not to be ignored.

3. We might also try different guesses of the proportion of infection at large. Suppose it to be as low as 1%.

  • Then of our 1000, 10 will be infected. Of these, 95% = 9 will test positive.
  • In addition, there will be 990 who are not infected, and 5% or about 49 of these will test as positive, making a total of 58.
  • Now the probability that the Governor is infected is 9/58 = 15%, much lower than before. So in this case, when the proportion of the overall population who are infected is rather small, the test is swamped by false positives.

4. Suppose the proportion of infected at large to be as high as 20%.

  • Then of our 1000, 200 will be infected. Of these, 95% = 180 will test positive.
  • In addition, there will be 800 who are not infected, and 5% or about 40 of these will test as positive, making a total of 220.
  • Now the probability that the Governor is infected would be 180/220 = 82%, much higher than before. Looking at these three examples it seems that the test becomes more accurate as the overall probability of infection increases.

5. To get a better idea of how things go without rehearsing the same argument over and over, we need a formula. We can follow the reasoning in these calculations to find it. Let $a$ be the sensitivity (with $0 \le a \le 1$ instead of being measured in a percentage), and let $b$ be the specificity. Suppose the test is given to $N$ people, and suppose $P$ of them to be infected.

  • Then $aP$ of these will be infected and test positive.
  • Similarly, $(1-a)P$ will be infected but test negative.
  • There are $N – P$ who are not infected, and of these the tests of $b(N-P)$ will return negative.
  • That also means that the remainder of the $N-P$ uninfected people, or $(1-b)(N-P)$, will test positive (these are the false positives).
  • That makes $aP + (1-b)(N-P)$ in total who test positive. Of these, the fraction who are infected, which we can interpret as the probability that a given person who tests positive is actually one of the infected is $$ { aP \over aP + (1 – b)(N-P) } $$
  • We can express this formula in terms of probabilities instead of population sizes. The ratio $p = P/N$ is the proportion of infected in the general population. The ratio $q = (N-P)/N$ is the proportion of uninfected. Then the ratio in the formula above is $$ { ap \over ap + (1-b)q } $$

The advantage of the formula is that we don’t have to consider each new value of $P/N$ on its own. Given $a$ and $b$, we can graph the formula as a function of the proportion $p = P/N$ of infected. For $a = 0.90$, $b = 0.95$ we get:

 

graph of proportion infected vs. probability a positive sample is infected

A good test is one for which the number of those who test positive is close to the number of those who are infected. The graph shows that this fails when the proportion of infected in the population at large is small. As I said earlier, the test is swamped by false positives.

Likewise, the proportion of infected who test negative (the potentially dangerous false negatives) is

$$ { (1-a)p \over (1 – a)p + bq } $$

And in a graph:

 

graph of proportion infected vs. probability a negative sample is infected

This time, the test is better if the graph lies low—if the number of infected who escape is small. The two graphs again illustrate that sensitivity and specificity are complementary. As the proportion infected in the population at large goes up, one part of the test improves and the other degrades.

The moral is this: sensitivity and specificity are intrinsic features of the tests, but interpreting them depends strongly on the environment.

Repeating the tests

As the case of Governor DeWine showed, the way to get around unsatisfactory test results is to test again.

To explain succinctly how this works, I first formulate things in a slightly abstract manner. Suppose we apply a test with sensitivity $a$ and specificity $b$ to a population with a proportion of $p$ infected and $q = 1-p$ uninfected. We can interpret these as probabilities: $p$ is the probability in the population at large of being infected, $q$ that of not being infected.

The test has the effect of dividing the population into two groups, those who test positive and those who test negative. Since the tests are not perfect, each of those again should be again partitioned into two smaller groups, the infected and the uninfected. Those who test positive have [infected:uninfected[] equal to $[ap : (1-b)q]$.

Those who test negative have [infected: uninfected] equal to $[(1-a)p:bq]$.

So the effect of the test is that we have divided the original population into two new populations, each of which also has both infected and infected. For those who tested positive, the new values of $p$ and $q$ are

$$ p_{+} = { ap \over ap + (1-b)q }, \quad q_{+} = { (1-b)q \over ap + (1-b)q } , $$

while for those who tested negative they are

$$ p_{-} = { (1-a)p \over (1-a)p + bq }, \quad q_{-} = { bq \over (1-a)p + bq }. $$

These are the new infection data we feed into any subsequent test, possibly with new $a$, $b$.

Now we can interpret Governor DeWine’s experience. The first test was with the Quidel antigen test, say with $a = 0.90$ and $b = 0.95$. Suppose $p = 0.05$. Since the test was positive, the probability of the Governor being infected is therefore 0.49. But now he (and, in our imaginations, his cohort of those who tests were positive) is tested again, this time by the more accurate PCR test, with approximate accuracy $a = 0.80$ and $b = 0.99$. This produces a new probability $p = 0.16$ that he is infected, still not so good. But we repeat, getting successive values of $0.037$ and then $ 0.008$ that he is infected. The numbers are getting smaller, and the trend is plain. Governor DeWine has escaped (this time).

 

Estimating the number of infected

In effect, the test is a filter, dividing the original population into two smaller ones. One is an approximation to the set of infected, the other an approximation to the set of uninfected.

A second use of these tests is to try to figure out what proportion of the population is in fact infected. This is especially important with COVID-19, because many of the cases show no symptoms at all. The basic idea is pretty simple, and can be best explained by an example.

In fact, let’s go back to an earlier example, with $N = 1000$, sensitivity $0.90$, specificity $0.95$, 150 infected. Of these, 177 test positive. We run the same test on these, and get 123 testing positive. Repeat: we get 109 positives.

What can we do with these numbers? The positives are comprised of true positives and false positives. The number of true positives is cut down by a factor of $0.90$ in each test, so inevitably the number of true positives is going to decrease as tests go on. But they decrease by a known ratio $a$! After one test, by a ratio $0.90$, after two tests, a ratio $0.90^{2} = 0.81$, after three by $0.90^{3} = 0.729$. Taking this into account, in order to find an approximation to the original number of infected we divide the number of positives by this ratio. We get the following table:

$$ \matrix { \hbox{ test number } & \hbox{ number of positives $N_{+}$ } & \hbox{ ratio $r$ } & N_{+}/r \cr 1 & 177 & 0.9 & 197 \cr 2 & 124 & 0.81 & 152 \cr 3 & 109 & 0.729 & 150 \cr } $$

Already, just running the test twice gives a pretty good guess for the original number infected.

One of my colleagues pointed out to me a simple way to visualize what’s going on. Suppose that we are looking at a population of 1000, 20% of whom are infected:

 

Square with left portion colored red for infected

Then we run a test on it, with 20% sensitivity and 90% specificity.

 

Square showing infected patients and test results

The regions at the top are those whose tests are false.

Reading Further

AMS Feature Column banner

Remembering Richard Kenneth Guy: Games and Taking on Mountains

This essay is dedicated to the memory of Richard K. Guy (1916-2020) whose life story and accomplishments display a variety of aspects of the complicated landscape of finding fulfilling work on the part of individuals, the concern of society with nourishing those who can produce and use mathematics, and the community of people who have devoted their lives to encourage the flourishing of mathematics.. …

Joseph Malkevitch
York College (CUNY

Introduction

What background is required to do research in mathematics? Who produces new mathematical results and why do such individuals do mathematical research? While mathematics has become a profession and there are individuals who, when asked what they do for a living, respond that they are mathematicians, in the grand scheme of human history of mathematics, its being a profession is of quite recent origin. This essay is dedicated to the memory of Richard K. Guy (1916-2020), whose life story and accomplishments display a variety of aspects of the complicated landscape of finding fulfilling work on the part of individuals, the concern of society with nourishing those who can produce and use mathematics, and the community of people who have devoted their lives to encourage the flourishing of mathematics.

Certainly there are many people who today would say they are “mathematicians” rather than that they teach mathematics, are professors of mathematics or work in industry or government as mathematicians. This group is also notable for having typically earned a doctorate degree in mathematics. Note, however, that many of the individuals who are thought of as great mathematicians of the 19th century and early 20th century who were British or Irish (e.g. Arthur Cayley, James Joseph Sylvester, Augustus de Morgan, William Rowan Hamilton) did not earn doctorate degrees. This was because of the uneven history of where doctorate degrees were awarded. Whereas doctorate degrees were awarded in France, Italy, and Germany as early as the 17th century and even in the United States in late 19th century, well into the 20th century it was common for British mathematicians not to get doctorate degrees. Galileo (1564-1642), though known as a physicist by the public, taught mathematics. And there were early universities in Paris and Bologna where mathematics was part of the curriculum and scholars of mathematics were trained.

However, one can also pursue a “career” involving mathematics if one studies mathematics education, physics, chemistry, economics, biology, computer science, statistics, and a multitude of other “academic” preparations for being employed and one builds on one’s mathematical skills or abilities. I will return to this issue later after taking a look at Richard Guy’s career, noteworthy for the variety of experiences he had in pursuit of his interest and love of mathematics but whose only doctorate was an honorary doctorate. Another thread I will touch on is the view that when mathematics leaps forward it is through the work of relatively young practitioners, and Richard Guy lived and actively worked into old age. He died at 103! Guy was an inspiration for those who love mathematics and are nervous that contributing to mathematics is done only by the relatively young.

Richard Guy–a brief biography

Richard Kenneth Guy was born in England in the town of Nuneaton, which is in the part of the country known as Warwickshire.

Photo of Richard K. Guy

Figure 1 (Photo of Richard Guy by Thane Plambeck, from Palo Alto, CA, courtesy of Wikipedia.)

Eventually he made his way to Cambridge University, whose structure is that one studies at one of various “competing” colleges. In the area of mathematics perhaps the best known of the Cambridge Colleges is Trinity College, partly because this was the college associated with Isaac Newton (1642-1727) but also because such other famous mathematicians as Isaac Barrow (1630-1677), Arthur Cayley (1821-1895), G.H. Hardy (1837-1947), Srinivasa Ramanujan (1887-1920), Bertrand Russell (1892-1970), William Tutte (1917-2002), and William Timothy Gowers. However, Guy went to Caius College, though the “official” name of the college is Gonville and Caius College. For a young country like America it is hard to realize that Gonville and Caius was founded in 1348 (well before Christopher Columbus voyaged to the New World). Some other mathematicians or mathematical physicists who were students at Caius and whose name you might recognize were:

George Green (1793-1841), John Venn (1834-1903), Ronald Fisher (1890-1962), Stephen Hawking (1942-2018)

Photo of Gonville-Caius College at Cambridge

Figure 2 (Photo of Gonville-Caius College at Cambridge, courtesy of Wikipedia.)

John Conway (1937-2020), who died recently was another person well known to the mathematics community who attended Caius College. Conway’s and Richard Guy’s careers crossed many times! Perhaps Guy’s most influential writing was his joint book with Conway and Elwyn Berlekamp, Winning Ways.

Photo of John Conway

Figure 3 (Photo of John Horton Conway.)

Guy served during World War II in the Royal Air Force where he did work related to forecasting the weather–a critical part of the British war effort.

Subsequently, at various times Guy worked as a teacher or took courses at various places. Guy married Nancy Louise Thirian in 1940. She died in 2010. Richard and Louise had three children. His son Michael Guy also did important work in mathematics. Guy relocated from Britain to Singapore in 1950 and taught mathematics at the University of Malaya, where he worked for 10 years. After leaving Singapore he spent time at the new Indian Institute of Technology in Delhi. In 1965, he “settled down” by moving to Canada where he taught in the mathematics department of the University of Calgary. Although he eventually “retired” from the University of Calgary he continued to show up at his office regularly until not long before he died. In Calgary he supervised several doctoral students. The following information is drawn from the Mathematics Genealogy Project

Roger Eggleton, University of Calgary, 1973 Nowakowski, Richard, University of Calgary, 1978 Dan Calistrate, University of Calgary, 1998 Jia Shen, University of Calgary, 2008

In turn Eggleton had at least 8 doctoral students and Nowakowski had at least 10.

While at Calgary, Guy helped organize various mathematics conferences, some of which took place at a location in the mountains not far from Calgary.

 

Photo of Banff Center for the Arts and Creativity

Figure 4 (Photo of the Banff Center for the Arts and Creativity, in Banff, Alberta [Canada], site for some conferences organized by Richard Guy. Photo courtesy of Wikipedia.)

Richard Guy and Louise both enjoyed the exercise and challenge of hiking. Well after many people give up heading out on the trail, he and Louise continued to hike. Guy also continued to hike after his wife died. A point of pride for Guy was that a Canadian Alpine Hut was named for him and Louise. The hut is not open year-round, but in a policy that would have pleased Guy and his wife, “to avoid pressuring the bear habitat, the Guy Hut will be closed between May 1 and November 30 annually.”

Richard Guy eventually got a doctorate degree but it was an honorary degree awarded by the University of Calgary in 1991, relatively late in his long and very productive career as an active mathematician. Much of Guy’s career was in academia but the qualifications for a career in academia vary considerably from one country to another and from one era to another. While most people who are famous for their contributions to mathematics in recent years have earned doctorate degrees, there are some famous examples, including Richard Guy, who found different routes to becoming distinguished contributors to mathematics. Guy, having been born in 1916, presents a relatively common case because of the history of granting the doctorate degree in Great Britain. Let me say something about this situation.

For those attempting to understand the history of mathematics, the development of particular branches of mathematics–geometry, combinatorics, non-associative algebras, partial differential equations, Markov chains, entire functions (a topic in the theory of functions of a complex variable), Banach spaces, etc. a wonderfully useful tool is the previously noted Mathematics Genealogy Project. On this website, searches for British mathematicians yield mathematical ancestors but often the names of the people shown as an ancestor were not a doctoral thesis supervisor but a person who “tutored” or was a primary influence for someone who went on to obtain a reputation in mathematical research but who did not have a doctorate degree. Another historical source of information about individual mathematicians (though not Guy, yet) and various aspects of mathematics is the MacTutor History of Mathematics Archive. Also see this fascinating history of the doctorate in Britain.

Richard Guy’s collaborators

MathsciNet lists a large number of people whom Guy collaborated with as an author. He edited quite a few books as well. As mentioned earlier, his best-known book, Winning Ways, was a joint project with John Horton Conway and Elwyn Berlekamp.

Photo of Elwyn Berlekamp

Figure 5 (Photo of Elwyn Berlekamp, courtesy of Wikipedia.)

 

Photo of John Horton Conway

Figure 6 (Photo of John Conway)

But during his long life, Guy had many collaborators. Some of these were “traditional” academics who taught at colleges and universities but two of his collaborators stand out for having benefited mathematics somewhat differently: Paul Erdős (1913-1996) and Martin Gardner (1914-2010).

Paul Erdős was a mathematical prodigy, born in Hungary. He not only showed talent for mathematics as a youngster, but also skill at developing new mathematical ideas and proving new results while quite young. (Musical prodigies show a split, too: those who at an early age can wow one with how well they play the violin or piano but also prodigies like Mozart and Mendelssohn who at amazingly young ages compose new and beautiful music.) Some prodigies continue to wow with their accomplishments as they age but some die young–Mozart died at 35, Schubert at 31, and Mendelssohn at 38. One can only be sad about the wonderful music they would have created had they lived longer, but we can be thankful for all of the wonderful music they did create.

Paul Erdős did not die young but his career did not follow a usual path. Rather than have a career at one academic institution, he became noted for becoming a traveling ambassador for innovative mathematical questions and ideas, particularly in the areas of number theory, combinatorics, and discrete geometry (including graph theory questions). Erdős’s creativity and brilliance created interest in how closely connected mathematicians were to Erdős via chains of joint publications (books and jointly edited projects do not count). This leads to the concept of Erdős number. Erdős has Erdős number 0. Those who wrote a paper with Erdős have Erdős number 1 and those who wrote a paper with someone who wrote a paper with Erdős have Erdős number 2. Richard Guy’s Erdős number was one, and some might argue that since he wrote 4 papers with Erdős that his Erdős number is 1/4. If you have written a joint paper with a mathematician and are curious to know what your Erdős number is you can find out by using a free “service” of the American Mathematical Society: Find your Erdős number!

Photo of Paul Erdos, Ron Graham and Fan Chung

Figure 7 (Photo of Paul Erdős, Ron Graham (recently deceased) Erdős number 1, and Fan Chung, Ron Graham’s wife–who also has Erdős number 1, courtesy of Wikipedia)

One can compute an Erdős number for people who lived in the distant past. Thus, Carl Gauss has an Erdős number of 4. (Enter names at the link above in the form Euler, Leonard. For this particular name, though, no path can be found!)

Martin Gardner (1914-2010) stands out as perhaps the most unusual promoter of mathematics for the general public, the science community, and research mathematicians in history. Not only did Gardner never earn a doctorate degree in mathematics, he did not earn any degree in mathematics. He did earn a degree in philosophy from the University of Chicago in 1936. Eventually he became involved with writing a mathematical recreations column for Scientific American magazine, and over a period of time these columns were published, anthologized, and augmented to create a constant stream of books that were well-received by the general public. Given the title of the column that Gardner wrote for Scientific American, Mathematical Games, it is not surprising that he drew on the work of Berlekamp, Conway, and Guy. These individuals made suggestions to Gardner for columns to write, and he in turn, when he had ideas for a column/article, often drew on their expertise to “flesh out” his ideas. Over the years, an army of Scientific American readers (including me) were stimulated by Gardner’s columns and books to make contributions to problems that Gardner wrote about or to pursue careers related to mathematics. Readers of Gardner’s books and columns also branched out and learned about mathematical topics that they had not seen in mathematics they had learned in school.

Photo of Martin Gardner

Figure 8 (Photo of Martin Gardner, courtesy of Wikipedia.)

Grundy’s game

It may be helpful to give an example of a combinatorial game that is easy to describe and which is perhaps new to some readers. One of my favorite combinatorial games is known as Grundy’s game, named for Patrick Michael Grundy (1917-1959). Like many combinatorial games it is played using a pile of identical coins, beans, or in Figure 9 line segments. The game is an example of an impartial game, which contrasts with partisan games where each player has his/her own pieces, as in chess or checkers.

Position for Grundy's game

Figure 9 (An initial configuration for Grundy’s game.)

Position for Grundy's game

Figure 10 (One can get to this position in Grundy’s game from the position in Figure 9 by a legal move.)

Grundy’s game works as follows. The game starts with a single pile of stones (here shown as short line segments in a column). The players alternate in making moves. A move consists of taking an existing pile of segments and dividing it into two piles of unequal size. Thus, for the pile in Figure 9, the player making a move from this position could move to a position where there would be one pile of 6 segments and the other of 2 segments. This would be a legal move. A player having to make a move from the position in Figure 10 can NOT make a move based on the pile with two segments, because the only way to divide this pile would create two equal sized piles, which is not allowed. However, there are several moves from this position. Divide the the first pile into piles of size 5 and 1 or of size 4 and 2. The game terminates when a player can’t make a move. John Conway promoted this rule of “normal” play in combinatorial games, namely, “if you can’t move, you lose.” If the termination rule is “if you can’t move, you win,” usually called misère play, then an analysis of optimal play is much harder, for reasons that are clear.

Rather remarkably, whatever the rules of a two-person impartial game, it turns out that the game is equivalent to playing the game of Nim (see below) with a single pile. Charles Bouton in 1902 found a way for any game of Nim, with one or many piles, to determine from the size of these piles whether the player making the next move would win or lose. In the language of combinatorial game theory, any position with legal moves that remain can be classified as being in an N position or a P position. An N position is one in which the next player to play can win with proper perfect play, while a P position is one where the previous player. having had this position, could win with proper perfect play.

Note, however, for a given game, knowing if a position is an N or a P position does not tell a player how to play optimally–that is, what moves to make to guarantee a win no matter what responses the opponent may make, or, if a position is hopeless, what moves to make to force the opponent to make as many moves as possible to win, and perhaps force a mistake to turn a losing position into a win. For a particular “board” of a combinatorial game that has moves that can be made, it can be thought of as either an N position or a P position.

Nim is played starting from a position with one or more piles of identical objects. A move consists of selecting a single pile and removing any number of objects from the pile, including removing all of the objects from the pile. Thus, treating the segments of Figure 9 as a Nim position, one could remove, 1, 2, 3, 4, 5, 6, 7, or 8 segments from the single pile. This position in Figure 9 is an N position. The player who moves from this position can win by removing all 8 segments. For Figure 10 can you see, with more thought, why this is also an N position. Bouton’s solution to checking a Nim position to see its status as an N position or P position involves representing the number of items in each pile as a binary number–binary number notation represents any positive integer using only the symbols 0 and 1. Summarizing what Bouton showed:

a. From a P position, every “move” leads to an N position

b. From an N position, there is some move that leads to a P position

c. P positions have a “value” of 0; N positions have a value that is positive.

The theory that showed all combinatorial two-person impartial games can be shown to have a single Nim value is due independently to the German-born mathematician Roland Percival Sprague (1894-1967) and English mathematician Patrick Michael Grundy (1917-1959). Remarkably, Richard Guy, also independently showed the same result.

Let us come back to Grundy’s game. In light of the discussion above, every initial position of n (positive integer) segments (coins, beans) has a Nim value, that is, a single number that represents the “value” of that position. Suppose one looks at the sequence of Nim values for Grundy’s game starting from a pile of n objects where each computation can take a fair amount of work. Starting with n = 0 this sequence looks like:

0, 0, 0, 1, 0, 2, 1, 0, 2, 1, 0, 2, 1, 3, 2, 1, 3, 2, 4, 3, …..

Remember that the meaning of a zero is that if it is currently your move, you have lost the game, because you have no move with optimal play by your opponent that will allow you a victory. If the value in the sequence is not zero this means that you have a position that allows you to make a move which will result in a win on the next move or allows you to move to a position from which your opponent cannot win!

What can one say about this sequence? The pattern seems rather irregular and even if you saw a much larger swath of this sequence it would not seem there was a pattern. However, Richard Guy conjectured that this sequence will eventually become “periodic.” This means that after neglecting some long initial piece of the sequence at some point a block of numbers, perhaps large in size, will repeat over and over!

Perhaps the work Guy is best known for is the joint work with Berlekamp and Conway on combinatorial games, which resulted in a book through which all three of these men are known to the public–Winning Ways. Taking off from the many new and old games described in this book there have been hundreds of research papers loosely belonging to combinatorial game theory but reaching into every branch of mathematics. While much of Winning Ways is devoted to discrete mathematics (which includes both finite set ideas as well as those involving discrete infinities), the book also treats topics that lead into complex areas related to “infinitely small” and “infinitely large” numbers.

Eventually John Conway wrote about the remarkable ways that studying combinatorial games of various kinds leads to ideas of numbers that go beyond the integers, rational numbers and real numbers that are the most widely studied discrete infinite and “dense” number systems, for which given any two numbers there is some other number of the system between them. This work of Conway is described in the book Surreal Numbers by Donald Knuth which is aimed at a general audience, and in Conway’s more technical book On Numbers and Games. Guy’s work helped popularize many of Conway’s more technical contributions.

There are two major ways mathematics has contributed to understanding “games.” One strand of game theory concerns conflict situations that arise in various attempts to understand issues related to economics–different companies or countries take actions to get good outcomes for themselves in situations that involve other “players.” The other is combinatorial games like Nim, Grundy’s game, and checkers and chess.

Richard Guy’s mathematics

Richard Guy’s influence as a mathematician comes from a blend of the influence he had through broadcasting the ideas and accomplishments of his collaborators and his role in calling attention to problems which could be understood and attacked with more limited tools than the most abstract parts of mathematics. He did not invent dramatic new tools to attack old problems, or chart out new concepts that would create new fields in mathematics or new avenues for mathematical investigation. He did not get awarded any of the growing number of prizes for dramatic accomplishments in an old field of mathematics (e.g. the work that earned James Maynard the Cole Prize in Number Theory in 2020) or for “lifetime” achievement such as the Abel Prize awarded to Karen Uhlenbeck in 2019–the first woman to win the Abel Prize.

To get an idea of Guy’s work, it helps to look at a sample of the titles of the articles he authored and co-authored. As you can see, they reflect the cheerful and playful aspects of his personality that he showed in his personal interactions with people.

  • All straight lines are parallel
  • The nesting and roosting habits of the laddered parenthesis
  • What drives an aliquot sequence?
  • John Isbell’s game of beanstalk and John Conway’s game of beans-don’t-talk
  • Primes at a glance
  • The second strong law of small numbers
  • Graphs and the strong law of small numbers
  • Nu-configurations in tiling the square
  • The primary pretenders
  • Catwalks, sandsteps and Pascal pyramids
  • The number-pad game
  • Don’t try to solve these problems!
  • Some monoapparitic fourth order linear divisibility sequences
  • Richard Rick’s tricky six puzzle: S5 sits specially in S6
  • A dozen difficult Diophantine dilemmas
  • New pathways in serial isogons

His most important books were:

  • Berlekamp, E. and J. Conway, R. Guy, Winning Ways for Your Mathematical Plays, Two volumes, Academic Press (1982)
  • Berlekamp, E. and J. Conway, R. Guy, Winning Ways for Your Mathematical Plays, Four volumes, AK Peters (2004)
  • Guy, Richard. Unsolved problems in number theory, Springer Science & Business Media, 2004 (with 2546 citations on Google Scholar as of July, 2020)
  • Croft, Hallard T. and Kenneth Falconer, Richard K. Guy,  Unsolved problems in geometry,  Springer Science & Business Media, 2012

These problem books have been reissued at times to update progress on old problems and give new problems. They have resulted in gigantic progress on a wide array of number theory and geometry problems.

Different countries over the years have had varying “success stories” in reaching a general public about the delights of mathematics. In America one of those success stories was Martin Gardner’s contributions mentioned above. In the Soviet Union a different intriguing phenomenon emerged. In America, the typical way one reached students in schools was to provide the students with materials that were not written by especially distinguished research mathematicians but rather by materials that “translated” what researchers had done and or thought about down to the level of the students. But in the Soviet Union MIR Publishers developed a series of books by very distinguished mathematicians and aimed directly for students. These books, originally published in Russian, were eventually made available to an English-speaking audience by a variety of publishers by translating books of these Soviet authors into English. There was the Little Mathematics Library which was marketed through MIR, there was Popular Lectures in Mathematics which were distributed by the University of Chicago, (under an NSF project lead by Izaak Wirszup), there was Topics in Mathematics published by Heath Publishing and another version published via Blaisdell Publishing (which was a division of Random House). Some of the Soviet books appeared in several of these series. While not up-to-date about some of the topics that were treated, to this day these books are remarkable examples of exposition. They are books I keep coming back to and which always reward me by another look. Perhaps my personal favorite is Equivalent and Equideomposable Figures, by Y. G. Boltyanskii. This book is an exposition of the amazing theorem now often called the Bolyai-Gerwien-Wallace Theorem. In brief this theorem says that if one has two plane simple polygons of the same area, one can cut up the first polygon with straight line cuts into a finite number of pieces which can be reassembled to form the second polygon, in the style of a jigsaw puzzle. Figure 11 shows a way to cut a square into four pieces and reassemble the parts into an equilateral triangle of the area of the original square!! Lots of current work is to find the minimal number of pieces needed to carry out dissections between two interesting polygon shapes of the same area.

Diagram illustrating the Bolyai-Gerwien-Wallace Theorem

Figure 11 (A square cut into 4 pieces which have been reassembled to form an equilateral triangle of the same area. Image courtesy of Wikipedia.)

In the mid-1980s, the Consortium for Mathematics and Its Applications (COMAP) approached the National Science Foundation about creating a similar series of books as those pioneered in the Soviet Union. Figure 12 shows the cover of one such book developed and written by Richard Guy, aimed at pre-college students and devoted to combinatorial games. Guy’s book Fair Game was published in 1989. His manuscript was submitted as a handwritten text which had to be retyped prior to publishing!

Cover of Richard Guy's book Fair Game

Figure 12 (Cover of the book Fair Game, by Richard Guy. Image courtesy of the Consortium for Mathematics and its Applications.)

Richard Guy sometimes referred to himself as an “amateur” mathematician.

While he wrote many papers and books, his legacy is not the theorems that he personally proved but his role as someone who loved mathematics and who was a mathematics enthusiast–writing about, teaching about, and doing mathematics. His wonderful promotion of mathematics will live on long after those who knew him personally are gone.

References

Those who can access JSTOR can find some of the papers mentioned above there. For those with access, the American Mathematical Society’s MathSciNet can be used to get additional bibliographic information and reviews of some of these materials. Some of the items above can be found via the ACM Digital Library, which also provides bibliographic services.

Some books which Richard Guy authored, co-authored or edited:

Guy, R., The Book of Numbers (joint with John Horton Conway), Springer-Verlag, 1996.

Guy, R., Fair Game, Consortium for Mathematics and Its Applications, 1989.

Guy, R.. and R. Woodrow, The Lighter Side of Mathematics, Proc. E. Strens Memorial Conf. on Recr. Math. and its History, Calgary. 1986.

Guy, Richard. Unsolved problems in number theory. Vol. 1. Springer Science & Business Media, 2004.

other references:

Albers, D. and G. Alexanderson (eds.), Fascinating Mathematical People, Princeton U. Press, 2011. (Includes Richard Guy.)

Albert, M. and R. Nowakowski, (eds), Games of No Chance 3, Cambridge U. Press, NY, 2009.

Conway, J., On Numbers and Games, 2nd. edition, AK Peters, Natick, 2001.

Croft, H. and Kenneth Falconer, Richard K. Guy. Unsolved problems in geometry: unsolved problems in intuitive mathematics. Vol. 2. Springer Science & Business Media, 2012.

Erdős, Paul, and Richard K. Guy, Crossing number problems, The American Mathematical Monthly 80 (1973): 52-58.

Guy, R.K. and Smith, C.A., The G-values of various games. In Mathematical Proceedings of the Cambridge Philosophical Society (Vol. 52, No. 3, pp. 514-526). Cambridge University Press, 1956.

Guy, R. K., & Nowakowski, R. J. (1995). Coin-weighing problems. The American mathematical monthly, 102(2), 164-167.

Nowakowski, R. (ed.), Games of No Chance, Cambridge U. Press, 1996.

Nowakowski, R. (ed)., More Games of No Chance, Cambridge U. Press, 2002.

AMS Feature Column banner

Just as a YouTube algorithm might recommend videos with more and more extremist views, machine learning techniques applied to crime data can magnify existing injustice. …

Ursula Whitcher
AMS | Mathematical Reviews, Ann Arbor, Michigan

What is predictive policing?

Predictive policing is a law enforcement technique in which officers choose where and when to patrol based on crime predictions made by computer algorithms. This is no longer the realm of prototype or thought experiment: predictive policing software is commercially available in packages with names such as HunchLab and PredPol, and has been adopted by police departments across the United States.

Algorithmic advice might seem impartial. But decisions about where and when police should patrol are part of the edifice of racial injustice. As the political scientist Sandra Bass wrote in an influential 2001 article, “race, space, and policing” are three factors that “have been central in forwarding race-based social control and have been intertwined in public policy and police practices since the earliest days” of United States history.

One potential problem with predictive policing algorithms is the data used as input. What counts as a crime? Who is willing to call the police, and who is afraid to report? What areas do officers visit often, and what areas do they avoid without a specific request? Who gets pulled over, and who is let off with a warning? Just as a YouTube algorithm might recommend videos with more and more extremist views, machine learning techniques applied to crime data can magnify existing injustice.

Measuring bias in predictive policing algorithms

William Isaac
Dr. William Isaac

In 2016, two researchers, the statistician Kristian Lum and the political scientist William Isaac, set out to measure the bias in predictive policing algorithms. They chose as their example a program called PredPol. This program is based on research by the anthropologist P. Jeffrey Brantingham, the mathematician Andrea Bertozzi, and other members of their UCLA-based team. The PredPol algorithm was inspired by efforts to predict earthquakes. It is specifically focused on spatial locations, and its proponents describe an effort to prevent “hotspots” of concentrated crime. In contrast to many other predictive policing programs, the algorithms behind PredPol have been published. Such transparency makes it easier to evaluate a program’s effects and to test the advice it would give in various scenarios.

Kristian Lum
Dr. Kristian Lum

Lum and Isaac faced a conundrum: if official data on crimes is biased, how can you test a crime prediction model? To solve this technique, they turned to a technique used in statistics and machine learning called the synthetic population.

The term “synthetic population” brings to mind a city full of robots, or perhaps Blade Runner-style androids, but the actual technique is simpler. The idea is to create an anonymized collection of profiles that has the same demographic properties as a real-world population.

Blade Runner costumes
A synthetic population isn’t about Blade Runner. (Photo by Zach Chisholm, CC BY 2.0.)

For example, suppose you are interested in correlations between choices for a major and favorite superhero movies in a university’s freshman class. A synthetic population for a ten-person freshman seminar might look something like this:

  1. Education major; Thor: Ragnarok
  2. Education major; Wonder Woman
  3. History major; Wonder Woman
  4. Math major; Black Panther
  5. Music major; Black Panther
  6. Music major; Black Panther
  7. Music major; Thor: Ragnarok
  8. Undeclared; Black Panther
  9. Undeclared; Thor: Ragnarok
  10. Undeclared; Wonder Woman

This is a toy model using just a couple of variables. In practice, synthetic populations can include much more detail. A synthetic population of students might include information about credits completed, financial aid status, and GPA for each individual, for example.

Lum and Isaac created a synthetic population for the city of Oakland. This population incorporated information about gender, household income, age, race, and home location, using data drawn from the 2010 US Census. Next, they used the 2011 National Survey on Drug Use and Health (NSDUH) to estimate the probability that somebody with a particular demographic profile had used illegal drugs in the past year, and randomly assigned each person in the synthetic population to the status of drug user or non-user based on this probabilistic model. They noted that this assignment included some implicit assumptions. For example, they were assuming that drug use in Oakland paralleled drug use nationwide. However, it’s possible that local public health initiatives or differences in regulatory frameworks could affect how and when people actually use drugs. They also pointed out that some people lie about their drug use on public health surveys; however, they reasoned that people have less incentive to lie to public health workers than to law enforcement.

West Oakland BART stop

A West Oakland transit stop. (Photo by Thomas Hawk, CC BY-NC 2.0.)

According to Lum and Isaac’s probabilistic model, individuals living anywhere in Oakland were likely to use illegal drugs at about the same rate. Though the absolute number of drug users was higher in some locations than others, this was due to greater population density: more people meant more potential drug users. Lum and Isaac compared this information to data about 2010 arrests for drug possession made by the Oakland Police Department. Those arrests were clustered along International Boulevard and in an area of West Oakland near the 980 freeway. The variations in arrest levels were significant: Lum and Isaac wrote that these neighborhoods “experience about 200 times more drug-related arrests than areas outside of these clusters.” These were also neighborhoods with higher proportions of non-white and low-income residents.

The PredPol algorithm predicts crime levels in grid locations, one day ahead, and flags “hotspots” for extra policing. Using the Oakland Police crime data, Lum and Isaac generated PredPol crime “predictions” for every day in 2011. The locations flagged for extra policing were the same locations that already had disproportionate numbers of arrests in 2010. Combining this information with their demographic data, Lum and Isaac found that Black people were roughly twice as likely as white people to be targeted by police efforts under this system, and people who were neither white nor Black were one-and-a-half times as likely to be targeted as white people. Meanwhile, estimated use of illegal drugs was similar across all of these categories (white people’s estimated drug use was slightly higher, at just a bit more than 15%).

International blvd poster
A poster reports police brutality on International Blvd. in Oakland. (Photo by Evan Hamilton, CC BY-NC 2.0.)

This striking disparity is already present under the assumption that increased police presence does not increase arrests. When Lum and Isaac modified their simulation to add arrests in targeted “hotspots,” they observed a feedback effect, in which the algorithm predicted more and more crimes in the same places. In turn, this leads to more police presence and more intense surveillance of just a few city residents.

In a follow-up paper, the computer scientists Sorelle Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian worked with a pair of University of Utah undergraduate students to explore feedback effects. They found that if crime reports were weighted differently, with crime from areas outside the algorithm’s “hotspots” given more emphasis, intensified surveillance on just a few places could be avoided. But such adjustments to one algorithm cannot solve the fundamental problem with predictions based on current crime reports. As Lum and Isaac observed, predictive policing “is aptly named: it is predicting future policing, not future crime.”

Further reading

  • Sandra Bass, “Policing Space, Policing Race: Social Control Imperatives and Police Discretionary Decisions,” Social Justice, Vol. 28, No. 1 (83), Welfare and Punishment In the Bush Era (Spring 2001), pp. 156-176 (JSTOR.)
  • Danielle Ensign, Sorelle A. Friedler, Scott Neville, Carlos Scheidegger and Suresh Venkatasubramanian. Runaway Feedback Loops in Predictive PolicingProceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), 2018.
  • Kristian Lum and William Isaac, To predict and serve? Significance, October 10, 2016. (The Royal Statistical Society.)
  • Cathy O’Neil, Weapons of Math Destruction.

Ursula Whitcher
AMS | Mathematical Reviews, Ann Arbor, Michigan

AMS Feature Column banner

Transmitting Data with Polar Codes

This then is the significance of Arikan’s polar codes: they provide encodings for an important class of channels that enable us to transmit information at the greatest possible rate and with an arbitrarily small error rate. …

David Austin
Grand Valley State University

Introduction

Wireless communication is currently transitioning to a new 5G standard that promises, among other advantages, faster speeds. One reason for the improvement, as we’ll explain here, is the use of polar codes, which were first introduced by Erdal Arikan in 2009 and which are optimal in a specific information-theoretic sense.

This column is a natural continuation of an earlier column that described Shannon’s theory of information. To quickly recap, we considered an information source $X$ to be a set of symbols $\cx$, such as letters in an alphabet, that are generated with a specific frequency $p(x)$. The amount of information generated by this source is measured by the Shannon entropy: $$ H(X) = -\sum_x p(x)\lg p(x). $$

A simple example is an information source consisting of two symbols, such as 0, which is generated with probability $p$, and 1, generated with probability $1-p$. In this case, the Shannon entropy is $$ H(p) = – p\lg(p) -(1-p)\lg(1-p). $$

Illustration of Shannon entropy

We described $H(X)$ as a measure of the freedom of expression created by the source. If $p=1$, then the source can only generate a string of 0’s, so there is only one possible message created by this source; it therefore conveys no information. On the other hand, when $p=1/2$, each symbol is equally possible so any string of 0’s and 1’s is equally likely. There are many possible messages, each of which is equally likely, so a given message can contain a specific meaning, and we think of this as a high-information source.

Entropy is measured in units of bits per symbol. However, if we imagine that one symbol is generated per unit time, we may also think of entropy–measured in bits per unit time–as measuring the rate at which information is generated. We earlier saw how this interpretation helped us determine the maximum rate at which messages generated by the source could be transmitted over a noiseless channel, a fact that is particularly relevant for us here.

Polar codes were created to maximize the rate at which information can be transmitted through a noisy channel, one that may introduce errors in transmission. So before we introduce these codes, we will first describe how Shannon’s theory tells us the maximum rate at which information can be transmitted over a noisy channel,

Transmission over a noisy channel

Let’s consider a channel $W$ through which we send a symbol $x$ and receive a symbol $y$. Ideally, we receive the same symbol that we send, or at least, the sent symbol can be uniquely recovered from the received symbol. In practice, however, this is not always the case.

For example, the following channel sends symbols from the set $\cx=\{0,1\}$ and receives symbols in $\cy=\{0,1\}$. However, there is a chance that the symbol is corrupted in transmission. In particular, the probability is $p$ that we receive the same symbol we send and $1-p$ that we receive the other symbol. Such a channel is called a binary symmetric channel, which we’ll denote as $BSC(p)$.

Our central problem is to understand the maximum rate at which information can be transmitted through such a channel and to find a way to do so that minimizes errors.

To this end, Shannon generalizes the concept of entropy. If $X$ is an information source whose symbols are sent through $W$ and $Y$ is the information source of received symbols, we have joint probabilities $p(x,y)$ that describe the frequency with which we send $x$ and receive $y$. There are also the conditional probabilities $p(y|x)$, the probability that we receive $y$ assuming we have sent $x$, and $p(x|y)$, the probability that we sent $x$ assuming we received $y$. These give conditional entropies $$ \begin{aligned} H(Y|X) = & -\sum_{x,y} p(x,y)\lg p(y|x) \\ H(X|Y) = & -\sum_{x,y} p(x,y)\lg p(x|y). \end{aligned} $$

We are especially interested in $H(X|Y)$, which Shannon calls the channel’s equivocation. To understand this better, consider a fixed $y$ and form $$ H(X|y) = -\sum_x p(x|y) \lg p(x|y), $$ which measures our uncertainty in the sent symbol $x$ if we have received $y$. The equivocation is the average of $H(X|y)$ over the received symbols $y$: $$ H(X|Y) = \sum_y p(y) H(X|y). $$ Thinking of entropy as a measure of uncertainty, equivocation measures the uncertainty we have in reconstructing the sent symbols from the received. We can also think of it as the amount of information lost in transmission.

For example, suppose that our channel is $BSC(1)$, which means we are guaranteed that $y=x$, so the received symbol is the same as the transmitted symbol. Then $p(x|y) = 1$ if $y=x$ and $p(x|y) = 0$ otherwise, which leads us to conclude that $H(X|Y) = 0$. In this case, the equivocation is zero, and no information is lost.

On the other hand, working with $BSC(1/2)$, we are as likely to receive a 0 as we are a 1, no matter which symbol is sent. It is therefore impossible to conclude anything about the sent symbol so all the information is lost. A simple calculation shows that the equivocation $H(X|Y) = H(X)$.

The difference $H(X) – H(X|Y)$, which describes the amount of information generated minus the amount lost in transmission, measures the amount of information transmitted by the channel. Shannon therefore defines the capacity of a noisy channel $W$ to be the maximum $$ I(W) = \max_X[H(X) – H(X|Y)] $$ over all information sources using the set of symbols $\cx$.

  • For example, if $W = BSC(1)$, we have $H(X|Y) = 0$ so $$ I(BSC(1)) = \max_X[H(X)] = 1, $$ which happens when the symbols 0 and 1 appear with equal frequency. The capacity of this channel is therefore 1 bit per unit time.
  • However, if $W=BSC(1/2)$, we have $H(X|Y) = H(X)$ so $$ I(BSC(1/2)) = \max_X[H(X) – H(X|Y)] = 0, $$ meaning this channel has zero capacity. No information can be transmitted through it.

The term capacity is motivated by what Shannon calls the Fundamental Theorem of Noisy Channels:

Suppose $X$ is an information source whose entropy is no more than the capacity of a channel $W$; that is, $H(X) \leq I(W)$. Then the symbols of $X$ can be encoded into $\cx$, the input symbols of $W$, so that the source can be transmitted over the channel with an arbitrarily small error rate. If $H(X) \gt I(W)$, there is no such encoding.

This result may initially appear surprising: how can we send information over a noisy channel, a channel that we know can introduce errors, with an arbitrarily small rate of errors? As an example, consider the channel $W$ with inputs and outputs $\cx=\cy=\{a,b,c,d\}$ and whose transmission probabilities are as shown:

Remembering that $I(W) = \max_X[H(X) – H(X|Y)]$, we find with a little work that the maximum occurs when the symbols of $X$ appear equally often so that $H(X) = 2$. It turns out that the equivocation $H(X|Y) = 1$, which implies that the capacity $I(W) = 1$ bit per unit time.

Suppose now that we have the information source whose symbols are $\{0,1\}$, where each symbol occurs with equal frequency. We know that $H(X) = 1$ so Shannon’s theorem tells us that we should be able to encode $\{0,1\}$ into $\{a,b,c,d\}$ with an arbitrarily small error rate. In fact, the encoding $$ \begin{aligned} 0 & \to a \\ 1 & \to c \end{aligned} $$ accomplishes this goal. If we receive either $a$ or $b$, we are guaranteed that $0$ was sent; likewise, if we receive either $c$ or $d$, we know that $1$ was sent. It is therefore possible to transmit the information generated by $X$ through $W$ without errors.

This example shows that we can use a noisy channel to send error-free messages by using a subset of the input symbols. Another technique for reducing the error rate is the strategic use of redundancy. As a simple example, suppose our channel is $BSC(0.75)$; sending every symbol three times allows us to reduce the error rate from 25% to about 14%.

Unfortunately, the proof of Shannon’s theorem tells us that an encoding with an arbitrarily small error rate exists, but it doesn’t provide a means of constructing it. This then is the significance of Arikan’s polar codes: they provide encodings for an important class of channels that enable us to transmit information at the greatest possible rate and with an arbitrarily small error rate. As we will see, the strategy for creating these codes is a creative use of redundancy.

Symmetric, binary-input channels

Before describing Arikan’s polar codes, let’s take a moment to clarify the type of channels $W:\cx\to\cy$ we will be working with. The inputs are $\cx=\{0,1\}$ and we require the channel to be symmetric, which means there is a permutation $\pi:\cy\to\cy$ such that $\pi^{-1} = \pi$ and $p(\pi(y)|1) = p(y|0)$. Such a channel is called a symmetric, binary-input channel. The symmetry condition simply ensures that the result of transmitting 0 is statistically equivalent to transmitting 1.

A typical example is the binary symmetric channel $BSC(p):\{0,1\}\to\{0,1\}$ that we described earlier. The symmetry condition is met with the permutation $\pi$ that simply interchanges 0 and 1.

There are two quantities associated to a symmetric channel $W$ that will be of interest.

  • The first, of course, is the capacity $I(W)$, which measures the rate at which information can be transmitted through the channel. It’s not hard to see that the capacity $$ I(W) = \max_X[H(X) – H(X|Y)] $$ is found with the information source $X$ that generates 0 and 1 with equal probability. This means that $p(x) = 1/2$ for both choices of $x$.

    Remembering that conditional probabilities are related by $$ p(x|y)p(y) = p(x,y) = p(y|x)p(x), $$ we have $$ \frac{p(x)}{p(x|y)} = \frac{p(y)}{p(y|x)}. $$ This implies that $$ \begin{aligned} -\sum_{x,y} p(x,y) & \lg p(x) + \sum_{x,y}p(x,y)\lg p(x|y) = \\ & -\sum_{x,y} p(x,y) \lg p(y) + \sum_{x,y}p(x,y)\lg p(y|x) \\ \end{aligned} $$ and so $$ H(X) – H(X|Y) = H(Y) – H(Y|X). $$

    Since $p(x) = \frac12$, we also have that $p(y) = \frac12 p(y|0) + \frac12 p(y|1)$ and $p(x,y) = \frac12 p(y|x)$. This finally leads to the expression $$ I(W) = \sum_{x,y} \frac12 p(y|x) \lg \frac{p(y|x)}{\frac12 p(y|0) + \frac12p(y|1)}. $$ This is a particularly convenient form for computing the capacity since it is expressed only in terms of the conditional probabilities $p(y|x)$.

    In particular, for the binary symmetric channel, we find $$ I(BSC(p)) = 1-H(p) = 1-p\lg(p) – (1-p)\lg(1-p), $$ reinforcing our earlier observation that $H(p)$ is a measure of the information lost in the noisy channel.

    Since the capacity of a symmetric, binary input channel is computed as $\max_X[H(X) – H(X|Y)]$ where $H(X) \leq 1$, it follows that $0\leq I(W) \leq 1$. When $I(W) = 1$, the equivocation $H(Y|X) = 0$ so no information is lost and this is a perfect channel. In the example above, this happens when $p=0$ or $1$.

    At the other extreme, a channel for which $I(W) = 0$ is useless since no information is transmitted.

  • A second quantity associated to a symmetric channel $W$ measures the reliability with which symbols are transmitted. Given a received symbol $y$, we can consider the product $$ p(y|0)~p(y|1) $$ as a reflection of the confidence we have in deducing the sent symbol when $y$ is received. For instance, if this product is 0, one of the conditional probabilities is zero, which means we know the sent symbol when we receive $y$.

    We therefore define the Bhattacharyya parameter $$ Z(W) = \sum_y \sqrt{p(y|0)~p(y|1)} $$ as a measure of the channel’s reliability. It turns out that $0\leq Z(W)\leq 1$, and lower values of $Z(W)$ indicate greater reliability. For instance, when $Z(W) = 0$, then every product $p(y|0)~p(y|1) = 0$, which means the symbols are transmitted without error.

    We expect $Z(W)$, the channel’s reliability, to be related to $I(W)$, the rate of transmission. Indeed, Arikan shows that $Z(W)\approx 0$ means that $I(W)\approx 1$ and that $Z(W)\approx 1$ means that $I(W)\approx 0$. So a high-quality channel will have $I\approx 1$ and $Z\approx 0$, and a poor channel will have $I\approx 0$ and $Z\approx 1$.

    For the binary symmetric channel $BSC(p)$, we see the following relationship:

Besides the binary symmetric channel, another example of a symmetric, binary-input channel is the binary erasure channel $W:\{0,1\}\to\{0,*,1\}$ as shown below. We think of the received symbol $*$ as an erasure, meaning the symbol was received yet we do not know what it is, so $\ep$ represents the probability that a sent symbol is erased.

Once again, the permutation $\pi$ that interchanges 0 and 1 while fixing $*$ confirms this as a symmetric channel that we denote as $BEC(\ep)$. The diagram gives the conditional probabilities that we need to find the capacity and Bhattacharyya parameter $$ \begin{aligned} I(BEC(\ep)) & = 1-\ep \\ Z(BEC(\ep)) & = \ep. \end{aligned} $$ Notice that if $\ep=0$, there is no possibility of an erasure and the capacity $I(BEC(0)) = 1$ bit per unit time. On the other hand, if $\ep=1$, every symbol is erased so we have $I(BEC(1)) = 0$ telling us this channel can transmit no information.

Polarization: A first step

Polar codes are constructed through a recursive process, the first step of which we’ll now describe. Beginning with a channel $W$, we’ll bundle together two copies of $W$ into a vector channel $W_2:\cx^2\to\cy^2$. Then we’ll pull the vector channel apart into two symmetric, binary-input channels $\chan{1}{2}$ and $\chan{2}{2}$ and observe how the capacity of these two channels is distributed.

In what follows, we will be considering a number of different channels. We will therefore denote the conditional probabilities of a particular channel using the same symbol as the channel. For instance, if $W=BEC(\ep)$, we will write, say, the conditional probability $W(*|1)=\ep$.

Our vector channel $W_2:\cx^2\to\cy^2$ begins with a two-dimensional input $(u_1,u_2)$ and forms $(x_1,x_2) = (u_1\oplus u_2, u_2)$, where $\oplus$ denotes integer addition modulo 2. Then $x_1$ and $x_2$ are transmitted through $W$, as shown below.

This gives the transmission probabilities $$ W_2((y_1,y_2) | (u_1,u_2)) = W(y_1|u_1\oplus u_2)W(y_2|u_2). $$ It is fairly straightforward to see that $I(W_2) = 2I(W)$ so that we have not lost any capacity.

From here, we obtain two channels: $$ \begin{aligned} \chan{1}{2}&: \cx\to\cy^2 \\ \chan{2}{2}&: \cx\to\cy^2\times \cx. \end{aligned} $$ whose transition probabilities are defined by $$ \begin{aligned} \chan{1}{2}((y_1,y_2)|u_1) & = \frac12 \left(W_2((y_1,y_2)|(u_1,0)) + W_2((y_1,y_2)|(u_1,1))\right) \\ \chan{2}{2}(((y_1,y_2),u_1)|u_2) & = \frac12 W_2((y_1,y_2)|(u_1,u_2)). \end{aligned} $$ This may look a little daunting, but we’ll soon look at an example illustrating how this works.

The important point is that we can verify that the total capacity is preserved, $$ I(\chan{1}{2})+I(\chan{2}{2}) = I(W_2) = 2I(W), $$ and redistributed advantageously $$ I(\chan{1}{2})\leq I(W) \leq I(\chan{2}{2}). $$ That is, $\chan{1}{2}$ surrenders some of its capacity to $\chan{2}{2}$, pushing $\chan{1}{2}$ toward a useless channel and $\chan{2}{2}$ toward a perfect channel.

Let’s consider what happens when our channel is a binary erasure channel $BEC(\ep)$.

Working out the transition probabilities for $\chan{1}{2}$, we have $$ \begin{array}{c||c|c} (y_1,y_2) \backslash x & 0 & 1 \\ \hline (0,0) & \frac12(1-\ep)^2 & 0 \\ (0,1) & 0 & \frac12(1-\ep)^2 \\ (0,*) & \frac12(1-\ep)\ep & \frac12(1-\ep)\ep \\ (1,0) & 0 & \frac12(1-\ep)^2 \\ (1,1) & \frac12(1-\ep)^2 & 0 \\ (1,*) & \frac12(1-\ep)\ep & \frac12(1-\ep)\ep \\ (*,0) & \frac12(1-\ep)\ep & \frac12(1-\ep)\ep \\ (*,1) & \frac12(1-\ep)\ep & \frac12(1-\ep)\ep \\ (*,*) & \ep^2 & \ep^2 \\ \end{array} $$ Rather than focusing on the individual probabilities, notice that five of the nine received symbols satisfy $\chan{1}{2}(y|0)~\chan{1}{2}(y|1) \neq 0$. More specifically, if either $y_1=*$ or $y_2=*$, it is equally likely that 0 or 1 was sent.

This observation is reflected in the fact that the capacity and reliability have both decreased. More specifically, we have $$ \begin{aligned} I(\chan{1}{2}) = (1-\ep)^2 & \leq (1-\ep) = I(W) \\ Z(\chan{1}{2}) = 2\ep-\ep^2 & \geq \ep = Z(W), \\ \end{aligned} $$ where we recall that larger values of the Bhattacharyya parameter indicate less reliability.

Let’s compare this to the second channel $\chan{2}{2}$. The received symbols are now $\cy^2\times \cx$. If we work out the conditional probabilities for half of them, the ones having the form $((y_1,y_2),0)$, we find $$ \begin{array}{c||c|c} ((y_1,y_2),0) \backslash x & 0 & 1 \\ \hline ((0,0),0) & \frac12(1-\ep)^2 & 0 \\ ((0,1),0) & 0 & 0 \\ ((0,*),0) & \frac12(1-\ep)\ep & 0 \\ ((1,0),0) & 0 & \frac12(1-\ep)^2 \\ ((1,1),0) & 0 & 0 \\ ((1,*),0) & 0 & \frac12(1-\ep)\ep \\ ((*,0),0) & \frac12(1-\ep)\ep & 0 \\ ((*,1),0) & 0 & \frac12(1-\ep)\ep \\ ((*,*),0) & \frac12\ep^2 & \frac12\ep^2 \\ \end{array} $$ This channel behaves much differently. For all but one received symbol, we can uniquely determine the sent symbol $x$. In particular, we can recover the sent symbol even if one of the received symbols has been erased. This leads to an increase in the capacity and a decrease in the Bhattacharyya parameter: $$ \begin{aligned} I(\chan{2}{2}) = 1-\ep^2 & \geq 1-\ep = I(W) \\ Z(\chan{2}{2}) = \ep^2 & \leq \ep = Z(W). \\ \end{aligned} $$

The capacities of these channels, as a function of $\epsilon$, are shown below.

It’s worth thinking about why the second channel $\chan{2}{2}$ performs better than the original. While the vector channel $W_2:\cx^2\to\cy^2$ transmits the pair $(u_1,u_2)$, $\chan{2}{2}$ assumes that $u_1$ is transmitted without error so that the received symbol is $((y_1,y_2), u_1)$. Since we know $u_1$ arrives safely, we are essentially transmitting $u_2$ through $W$ twice. It is this redundancy that causes the capacity and reliability to both improve. In practice, we will need to deal with our assumption that $u_1$ is transmitted correctly, but we’ll worry about that later.

If we had instead considered a binary symmetric channel $W = BSC(p)$, we find a similar picture:

Polarization: The recursive step

We have now taken the first step in the polarization process: We started with two copies of $W$ and created two new polarized channels, one with improved capacity, the other with diminished capacity. Next, we will repeat this step recursively so as to create new channels, some of which approach perfect channels and some of which approach useless channels.

When $N$ is a power of 2, we form the vector channel $W_{N}:\cx^{N}\to\cy^{N}$ from two copies of $W_{N/2}$. The recipe to create $W_4$ from two copies of $W_2$ is shown below.

From the inputs $(u_1, u_2, \ldots, u_{N})$, we form $(s_1,\ldots, s_{N})$ as $s_{2i-1} = u_{2i-1}\oplus u_{2i}$ and $s_{2i} = u_{2i}$. We then use a reverse shuffle permutation to form $$ (v_1,v_2,\ldots,v_{N}) = (s_1, s_3, \ldots, s_{N-1}, s_2, s_4,\ldots, s_{N}), $$ which serve as the inputs for two copies of $W_{N/2}$.

While that may sound a little complicated, the representation of $W_8$ shown below illustrates the underlying simplicity.

We now form new symmetric, binary-input channels $$ \chan{i}{N}:\cx\to \cy^N\times \cx^{i-1} $$ as we did before: $$ \begin{aligned} \chan{i}{N}(((y_1,\ldots,y_N),& (u_1,\ldots,u_{i-1}))~|~u_i) = \\ & \frac1{2^{N-1}}\sum_{(u_{i+1},\ldots,u_N)} W_N((y_1,\ldots,y_N)|(u_1,\ldots,u_N)). \end{aligned} $$

Arikan then proves a remarkable result:

If we choose $\delta\in(0,1)$, then as $N$ grows through powers of two, the fraction of channels satisfying $I(\chan{i}{N}) \in (1-\delta, 1]$ approaches $I(W)$. Likewise, the fraction of channels with $I(\chan{i}{N}) \in [0,\delta)$ approaches $1-I(W)$.

Suppose, for instance, that we begin with a channel whose capacity $I(W) = 0.75$. As we repeat the recursive step, we find that close to 75% of the channels are nearly perfect and close to 25% are nearly useless.

More specifically, with the binary erasure channel $BEC(0.25)$, whose capacity is 0.75, and $N=1024$, the channel capacities are as shown below. About three-quarters of the channels are close to perfect and about one-quarter are close to useless.

For comparison, here’s the result with $BEC(0.75)$, whose capacity is 0.25.

It should be no surprise that the channels $\chan{i}{N}$ with small $i$ are close to useless while the ones with large $i$ are close to perfect. Remember that $\chan{i}{N}$ assumes that the symbols $(u_1,\ldots,u_{i-1})$ are transmitted without error. When $i$ is large, we know that many of the symbols are reliably transmitted so the channel transmits $u_i$ with a lot of redundancy. This redundancy causes an increase in channel capacity $I(\chan{i}{N})$ and a corresponding decrease in our reliability measure $Z(\chan{i}{N})$.

Encoding and decoding

Now that we have a large number of nearly perfect channels, we can describe how data is encoded and decoded to enable transmission through them. For some large value of $N$, we have approximately $NI(W)$ nearly perfect channels and $N-NI(W)$ nearly useless channels. Our goal is to funnel all the data through the nearly perfect channels and ignore the others.

To illustrate, consider the channel $W = BEC(0.5)$ with $I(W) = 0.5$, and suppose we have created $N=8$ channels. This is a relatively small value of $N$, but we expect $NI(W) = 4$ nearly perfect channels and $N-NI(W)=4$ nearly useless channels. The capacity of the 8 channels is as shown here.

We will choose channels 4, 6, 7, and 8 to transmit data. To do so, we will fix values for $u_1$, $u_2$, $u_3$, and $u_5$ and declare these to be “frozen” (these are polar codes, after all). Arikan shows that any choice for the frozen bits works equally well; this shouldn’t be too surprising since their corresponding channels transmit hardly any information. The other symbols $u_4$, $u_6$, $u_7$, and $u_8$ will contain data we wish to transmit.

Encoding the data means evaluating the $x_i$ in terms of the $u_j$. As seen in the figure, we perform $N/2=4$ additions $\lg(N) = 3$ times, which illustrates why the complexity of the encoding operation is $O(N\log N)$.

Now we come to the problem of decoding: if we know ${\mathbf y} = (y_1,\ldots, y_N)$, how do we recover ${\mathbf u} = (u_1,\ldots, u_N)$? We decode ${\mathbf y}$ into the vector $\hat{\mathbf u} = (\hat{u}_1,\ldots,\hat{u}_N)$ working from left to right.

First, we declare $\hat{u}_i = u_i$ if $u_i$ is frozen. Once we have found $(\hu_1, \ldots, \hu_{i-1})$, we define the ratio of conditional probabilities $$ h_i = \frac{\chan{i}{N}(({\mathbf y}, (\hu_1,\ldots,\hu_{i-1}))|0)} {\chan{i}{N}(({\mathbf y}, (\hu_1,\ldots,\hu_{i-1}))|1)}. $$ If $h_i \geq 1$, then $u_i$ is more likely to be 0 than 1 so we define $$ \hu_i = \begin{cases} 0 & \mbox{if}~ h_i \geq 1 \\ 1 & \mbox{else.} \end{cases} $$

Two observations are important. First, suppose we begin with ${\mathbf u}=(u_1,\ldots,u_N)$ and arrive at $\hat{\mathbf u} = (\hu_1,\ldots,\hu_N)$ after the decoding process. If $\hat{\mathbf u} \neq {\mathbf u}$, we say that a block error has occurred. Arikan shows that, as long as the fraction of non-frozen channels is less than $I(W)$, then the probability of a block error approaches 0 as $N$ grows. This makes sense intuitively: the channels $\chan{i}{N}$ we are using to transmit data are nearly perfect, which means that $Z(\chan{i}{N})\approx 0$. In other words, these channels are highly reliable so the frequency of errors will be relatively small.

Second, Arikan demonstrates a recursive technique for finding the conditional probabilities $\chan{i}{N}$ in terms of $\chan{j}{N/2}$, which means we can evaluate $h_i$ with complexity $O(\log N)$. Therefore, the complexity of the decoding operation is $O(N\log N)$, matching the complexity of the encoding operation.

The complexity of the encoding and decoding operations means that they can be practically implemented. So now we find ourselves with a practical means to transmit data at capacity $I(W)$ and with an arbitrarily small error rate. We have therefore constructed an encoding that is optimal according to the Fundamental Theorem of Noisy Channels.

Summary

Arikan’s introduction of polar codes motivated a great deal of further work that brought the theory to a point where it could be usefully implemented within the new 5G standard. One obstacle to overcome was that of simply constructing the codes by determining which channels should be frozen and which should transmit data.

In our example above, we considered a specific channel, a binary erasure channel $W = BSC(\ep)$, for which it is relatively easy to explicitly compute the capacity and reliability of the polarized channels $\chan{i}{N}$. This is an exception, however, as there is not an efficient method for finding these measures for a general channel $W$. Fortunately, later work by Tal and Vardy provided techniques to estimate the reliability of the polarized channels in an efficient way.

Additional effort went into improving the decoding operation, which was, in practice, too slow and error prone to be effective. With this hurdle overcome, polar codes have now been adopted into the 5G framework, only 10 years after their original introduction.

 

References

AMS Feature Column banner

Mathematics and the Family Tree of SARS-Cov-2

It's a remarkable process and, if it weren't so dangerous to humanity, would be deserving of admiration. …

Bill Casselman
University of British Columbia, Vancouver, Canada

Introduction

There are two major ways in which mathematics has contributed to our understanding of the disease CoVid-19 and the coronavirus SARS-Cov-2 that causes it. One that is very prominent is through mathematical modelling, which attempts to predict the development of the epidemic in various circumstances. In earlier Feature Columns I wrote about modelling measles epidemics, and much of what I wrote there remains valid in the new environment. With the appearance of the extremely dangerous CoVid-19, mathematical modelling has become even more important, but the complexity of this work has also become apparent, and maybe not suitable for discussion here.

Another relevant application of mathematics attempts to track the evolution of the virus responsible. The core of every virus is a small string of replicating genetic material, either DNA or RNA, which sits enclosed in a structure of some kind. The structure serves to keep the virus in one piece, but also, and most crucially, facilitates the insertion of the genetic material into target cells. (It is in this respect that SARS-Cov-2 outdoes SARS-Cov-1.) The genetic material then takes over the machinery of the invaded cell in order to reproduce itself. It's a remarkable process and, if it weren't so dangerous to humanity, would be deserving of admiration.

In larger animals, the basic genetic material amounts to double helices made up of two complementary strands of nucleotides—i.e., made up of the four nucleotides A (adenine), G (guanine), C (cytosine), and T (thymine). In the virus that produces CoVid-19 the genetic material is RNA, which in the larger cells plays little role in reproduction, but is involved in translating genetic instructions to the construction of proteins. It is a single strand of about 30,000 nucleotides, in which T is replaced by U (uracil). I'll ignore this distinction.

There is some variation in these strands among different populations of viruses, because strands that differ slightly can often still work well. The variation is caused by random mutations, which substitute one nucleotide for another, and a few other possibly damaging events such as deletions and insertions of genetic material. I do not know what, exactly, causes these, but some viruses are much more prone than others to mutate. For example, what makes the common influenza viruses so difficult to control is that they mutate rapidly, particularly in ways that make it more difficult to recognize them by antibodies from one year to the next. It's easy to believe that this "strategy" is itself the product of natural selection.

The good news is that coronaviruses do not apparently mutate rapidly. They do mutate enough, however, that RNA sequences of viruses from different parts of the world may be interpreted as a kind of geographical fingerprint. Keeping track of these different populations is important in understanding the spread of the disease. Similar techniques are also used to try to understand how SARS-Cov-2 originated in wild animal populations.

Where's the mathematics?

I'll try to give some idea of how things work by looking at a very small imaginary animal, with just four nucleotides in its genome. Every generation, it divides in two. In the course of time each of the progeny can suffer mutations at one site of the genome, and eventually each of these in turn divides. The process can be pictured in a structure that mathematicians call a tree. In the following example, which displays the process through three generations, time is the vertical axis. The animal starts off with genome ACGT. The colored edges mark mutations.

 

A structure like this is called a rooted tree by mathematicians and biologists. It is a special kind of graph.

And now I can formulate the basic problem of phylogenetics: Suppose we are given just the genetics of the current population. Can we deduce the likely history from only these data? The answer is that we cannot, but that there are tools that will at least approximate the history. They are all based on constructing a graph much like the one above in which genomes that are similar to each other are located close to one another. But how to measure similarity? How to translate a measure of similarity to a graph? You can already see one difficulty in the example above. The genome ACCT occurs in three of the final product, but although two of these are related in the sense that they are common descendants and therefore tied to the history, one is the consequence of independent mutations. It is difficult to see how this could be captured by any analysis of the final population.

It is this problem that the website NextStrain is devoted to. It is most distinguished for its wonderful graphical interpretation of the evolution of different populations of SARS-Cov-2. But, as far as I can see, it tells you next to nothing about underlying theory.

In the rest of this column, I'll try to give some rough idea of how things go. But before I continue, I want to come back to an earlier remark, and modify the basic problem a bit. I am no longer going to attempt to reconstruct the exact history. In reality, this is never the right question, anyway—for one thing, no process is a simple as the one described above. Reproduction is irregular, and one generation does not occupy steps of a fixed amount of time. Some organisms simply die without reproduction. Instead, we are going to interpret a tree as a graphical way to describe similarity, with the point being that the more similar two genomes are, the closer they are likely to be in our tree, and the more recent was their common ancestor. In particular, not all branches in our tree will be of the same length. The tree

 

illustrates only that A and B are more closely related than A and C or B and C. The actual lengths of edges in a tree will not be important, or even their orientation, as long as we know which node is the root. It is just its topological structure that matters.

Listing trees

All the phylogenetic methods I know of deal with trees that are rooted and binary. This means that (i) they all start off from a single node at the bottom and go up; (ii) whenever they branch, they do so in pairs. The endpoints of branches are called (naturally) leaves. We shall also be concerned with labeled trees in which a label of some kind is attached to every leaf. The labels we are really interested in are genomes, but if we number the genomes we might as well use integers as labels.

In order to understand some of the methods used to assemble relationship trees, we should know how to classify rooted binary trees (both unlabeled and labeled) with a given number of leaves.

There are two types of classification involved. First one lists all rooted binary trees of a given shape, and then one looks at how to label them. For a small number of leaves both these steps are very simple, if slightly tedious.

Two leaves. They all look like the figure on the left:

 

And there is essentially one way to label them, shown on the right. What do I mean, "essentially"? I'll call two trees essentially the same if there exists a transformation of the one of them into the other that takes edges to edges and the root to the root. I'll call two labelings the same if such a transformation changes one into the other. In the tree above the transformation just swaps branches.

Three leaves. There is only essentially one unlabeled tree. To start with, we plant a root. Going up from this are two branches. One of them must be a leaf, and the other must branch off into two leaves. So the tree looks essentially like this:

 

As for labelings, there are three. One must first choose a label for the isolated leaf, then the others are `essentially' determined:

 

(There are three other possible labelings, but each yields a tree essentially the same as one of those above.)

Four leaves. There are two types of unlabeled trees, one with branching to an isolated leaf and an unlabeled tree of three leaves, the other with two branches of two leaves each.

 

There are 15 labelings. There is a systematic way to list all rooted trees with $n$ leaves, given a list of all those with $n-1$, but I won't say anything about it, other than to hint at it in the following diagram:

 

I will say, however, that there are 105 labeled trees with 5 leaves, and that in general, there are $1\cdot 3 \cdot \, \ldots \, \cdot (2n-3)$ labeled trees with $n$ leaves. (I refer for this to Chapter 2 of the lecture notes by Allman and Rhodes.) This number grows extremely rapidly. It is feasible to make lists for small values of $n$, but not large ones. This impossibility plays a role in reconstructing the phylogeny of a large set of viruses.

Reconstruction

Reconstructing the entire history in practical cases is for all practical purposes impossible. What one attempts is just to find an approximation to it. The structure of the candidate graph should reflect at least how close the items are to each other. As explained in the lecture notes by Allman and Rhodes, there are many different ways to measure closeness, and many different techniques of reconstruction. None is perfect, and indeed there is no perfect method possible.

The techniques come in two basic flavours. Some construct a candidate tree directly from the data, while others search through a set of trees looking for the one that is best, according to various criteria. The ones that construct the tree directly don't seem to do well, and the methods most often used examine a lot of trees to find good ones. I'll look at the one that Allman and Rhodes call parsimony. It is not actually very good, but it is easy to explain, and gives some idea of what happens in general. It tries to find the tree that is optimal in a very simple sense—it minimizes, roughly, the total number of mutations in terms of the connections of the tree. (The term, like many in modern biology, has taken on a technical meaning close to, but not quite the same as, that in common usage. According to Wikipedia, parsimony refers to the quality of economy or frugality in the use of resources.)

In this method, one scans through a number of rooted trees whose labels are the given genomes, and for each one determines a quantitative measure (which I'll call its defect) of how closely the genomes relate to each other. One then picks out that tree with the minimum defect as the one we want. In fact, in this method as in all of this type there may be several, more or less equivalent.

How is the defect of a candidate labeled tree computed? I'll illustrate how this goes for only one set of genomes and one tree:

 

Step 1. The defect of the tree is the sum of defects of each of its nucleotides (in our case, four). This is done by calculating at every node of the tree a certain subset of nucleotides and a certain defect, according to this rule: (i) if the node is a leaf, assign it just the nucleotide in question, and defect 0. Progress from leaves down. At an internal node whose branches have already been dealt with, assign a subset — either (i) the intersection of the subsets just above it if it is non-empty, or (ii) otherwise, their union. In the first case, assign as defect the sum of defects just above, but in the second assign the sum plus 1. These four diagrams show what happens for our example.

 

 

The total defect for this labelled tree is therefore 4.

Step 2. In principle, you now want to scan through all the trees with $n$ leaves, and for each one, calculate its defect. There will be one or more trees with minimal defect, and these will be the candidates for the phylogeny. In practice, there will be too many trees to look at all, so you examine a small collection of sample trees and then look at a few of their neighbours to see if you can make the defect go down. The trick here, as with many similar methods is to choose samples well. I refer to Chapter 9 of Allman-Rhodes

Dealing with SARS-Cov-2

You can test a method for reconstructing an evolution by running some evolutionary processes and then applying the method to see how it compares. In practice, the structure parsimony produces are not generally satisfactory. In practice, it is the method "Maximum Likelihood" used most often. Allman-Rhodes say something about that method, too. As with every one of the methods so far developed, there are both theoretical and practical difficulties with it. But in most situations its results seem to be satisfactory.

With SARS-Cov-2, there are at least very satisfactory sources of data — GenBank and GISAID. I say something about these in the reference list to follow. Both of these websites offer more than thousands of genomes for SARS-Cov-2 as well as other organisms. These have been contributed by medical researchers from all over the world, and they are well documented.

Reading further

Mathematics

Biology

  • University of Washington software list

    To a mathematician, the sheer volume of publications on medical topics is astonishing. I list this site to illustrate this even for the relatively arcane topic of phylogeny. Incidentally, the original reason for developing phylogenetic tools was to track the evolutionary development of life on Earth, which was a matter of interest even before Darwin's theory of evolution appeared. For many years, it was based on morphological aspects rather than genomes. But now that gene sequencing is so simple, it has also become an important part of medical investigation.

AMS Feature Column banner

Points, Lines, and Incidences

To honor Mathematics and Statistics Awareness Month, I will treat some remarkable developments concerning a basic intuitive idea in geometry concerning points and lines. …

Joseph Malkevitch
York College (CUNY

Introduction

When people think about mathematics they think of numbers and shapes. Numbers covers the interest of mathematics in arithmetic and algebra. And shape relates to mathematics' concern with geometry. However, in many ways a better way to capture the essence of mathematics is to say that it is concerned with understanding patterns of all kinds, whether they be numerical or geometrical patterns or patterns in more general senses.

To honor Mathematics and Statistics Awareness Month, I will treat some remarkable developments concerning a basic intuitive idea in geometry concerning points and lines. For our discussion I will be concerned with points, lines and segments (portions of lines between two points) that are drawn in the Euclidean plane in contrast to points and lines on another surface such as a sphere or ellipsoid or in the hyperbolic (many parallel lines) or projective planes (no parallel lines). Many of these ideas have ancient roots in many cultures, but often mathematics reinvents itself by trying to understand ideas that at first glance might seem to have been resolved in the distant past.

Configurations of points and lines

Consider the diagram in Figure 1 that shows two triangles being viewed by an eye. While this diagram has infinitely many points on the lines, we will only be interested in the points accentuated with dots. In this diagram there are many segments but some of these lie along the same line. In drawings of this kind we can show only a part of a line. Note that the corresponding vertices of the triangles determine three lines: AA', BB' and CC' that meet at the point called Eye. Note also that we will consider EyeA', EyeA and AA' as three different segments even though EyeA' is made up of two smaller (shorter) segments.

Partial Desargues configuration

Figure 1 (The vertices of two triangles viewed from the point named Eye.)
 

If you count the number of points in Figure 1, there are 7 and the number of lines is 9. Three of the lines have three points on them and the other 6 lines have two points on them. Note that while the segment EyeC and the segment AB meet at a point, we will not be concerned with that point or the one where segment A'B' meets segment CC'. We need to decide how to report the information we see. The segment EyeA' consists of the segments EyeA and AA' which make up the longer segment but the points Eye, A and A' all are points on one line.

The set of labeled points shown in Figure 1 accounts for some two-point lines and some of the points lie on 3-point lines, but note that the diagram does not show all of the lines that the points in the set of points (7 points) might determine. If 7 points were in "general position" they would determine 21 (= (7*6)/2 ) different lines. You are probably aware that any two distinct points in the Euclidean plane determine a unique line. However, if one starts, for example, with 4 points on a line, then any pair of these points determines the same line.

Figure 1 is the starting point for a famous theorem that I will return to in a moment, but to set the stage for what is to come let us also look at the points and lines in Figure 2. This diagram calls attention to 7 points also, but with only 3 lines. One line has 2 points, another 3 points and the third line has 4 points. Five points are shown with only a single line passing through those points and two points have two lines which pass through them. However, perhaps you are wondering if the two "vertical" lines actually intersect (meet) somewhere in the distance. Many people would say the diagram suggests that the lines are parallel, that is, that they do not have a point in common.

Diagram illustrating incidence concept; 7 points, 3 lines
Figure 2 (A collection of 7 points and three lines.)
 

Though in many ways the result I am about to discuss belongs more naturally to projective geometry (where there are no parallel lines, all pairs of distinct lines meet in exactly one point) than Euclidean geometry, it is a theorem as stated for the Euclidean plane.

There is much more that can be said about the diagram that we started with in Figure 1. Figure 3 shows another diagram that helps illustrate the mathematical result involved. The point labeled "Eye" in Figure 1 corresponds to the point labeled O in Figure 2. Furthermore the two triangles in Figure 1 did not overlap but they do in Figure 2. What happens (Figure 3) when we look at the lines determined by the corresponding vertices of the two triangles, the lines determined by AA', by BB' and by CC'? Of course it might be possible that some pair of these three lines would be parallel and would not meet. However, let us assume that in fact every pair of these lines do intersect, and they intersect in the points P, Q and R. (P and Q are shown in Figure 2 but not R.) Can anything be said about this triple of points? Ordinarily if one picks three points at "random," one would not expect them to lie on a line but one would expect them to be the vertices of a triangle. However, the remarkable fact here is that when the two initial triangles have the property that the lines connecting their corresponding vertices meet at a point, the corresponding edges of the triangles (when no pair is parallel) meet at points which lie on a single line!!

Full Desargues configuration

 

Figure 3 (A Desargues configuration but not showing the point where AB and A'B' intersect at R, which would be on the line with P and Q.)
 

This theorem was not noticed by the great Greek geometers but was discovered by Girard Desargues (1591-1661). It is an example of a situation in mathematics where something that one would not expect to happen, as if by magic, does happen. This ability of mathematics to make surprising connections in unexpected situations is part of what attracts so many people to the subject. For those not sure that Figure 3 might just be an accident, you can take a look at another copy of a Desargues configuration (Figure 4) highlighting the two triangles in color and–indicated in red–the point that they are in "perspective" (where the blue lines intersect), and the line that they are in "perspective" from.

Coler coded Desargues configuration
  Figure 4 (Another completed Desargues configuration showing 10 points and lines. Courtesy of Wikipedia.)

 

Here is one formal statement of Desargues's Theorem:

If the corresponding vertices of two distinct triangles in the Euclidean plane ABC and abc pass through a single point O then the corresponding sides of these triangles intersect at points which lie on a single line.

Sometimes this theorem is stated that if two triangles are "in perspective" from a point they are "in perspective" from a line.

Theorems like Desargues's Theorem have come to be known as configuration theorems; there are other theorems of this kind that predate Desargues.

In addition to the "configuration" consisting of points and lines that is represented by the 10 points and 10 lines implicit in Figure 4, there is another amazing configuration theorem which is due to the great mathematician Pappus of Alexandria (lived approximately 290-350).

Pappus configuration

Figure 5 (A diagram illustrating Pappus's Theorem.)
 

Pappus's Theorem can be stated that given six points, A, B, C on one line and a, b, c on another line which intersect (none of the points being where the two lines intersect) (as shown in Figure 5) then the three points where Ab intersects aB, Ac intersects aC and Bc intersects bC are collinear (lie on a single line)!

Whereas Desargues's Theorem can be thought of as 10 points and 10 lines with 3 points on a line and 3 lines passing through a point, the Pappus configuration can be thought of as having 9 points and 9 lines where three points lie on each of the specified lines and triples of lines pass through the specified points.

Pappus's Theorem can be considered as a special case of another remarkable configuration theorem that was discovered by Blaise Pascal (1623-1662), the French philosopher, author and mathematician.

Portrait of Blaise Pascal

Figure 6 (Portrait of Blaise Pascal.)
 

Pappus's Theorem involves 6 points which lie on two lines and the way these points determine lines that create points which lie on a line as well. But since two lines are both equations of degree 1 (equations of the form $ax + by + c = 0$ where $a, b,$ and $c$ are real numbers, with both of $a$ and $b$ not zero), if we multiply these two equations together we get an equation which is of degree two in the variables $x$ and $y$. This would be a quadratic equation in two variables. The "shapes" you are probably most familiar with that satisfy such equations are the conic sections: circles, ellipses, parabolas, and hyperbolas. But two intersecting straight lines or two parallel straight lines can be thought of as "degenerate" conic sections. (A conic section gets its name because it can be thought of as what results from cutting a cone with a plane.) What Pascal realized was that one could generalize Pappus's Theorem to planar conics.

Pascal's Theorem:

Given 6 points which form a hexagon whose vertices lie on a conic section, then the three pairs of opposite sides of this hexagon meet at points which lie on a straight line.

Pascal's Theorem diagram

Figure 7 (A diagram illustrating Pascal's Theorem, involving 6 points inscribed in a conic section, here taken to be an ellipse. Diagram courtesy of Wikipedia.)
 

The rise of interest in configuration theorems such as those of Pappus, Desargues, and Pascal were related to attempts to put an understanding of geometry in a richer context. Attempts to show that Euclid's 5th Postulate, in a modern formulation:

If p is a point not on line l then there is a unique parallel to line l though point P

could be deduced from the other axioms (starting assumptions) were not successful. We know today why they weren't: there are mathematically (logically consistent) geometries where Euclid's 5th postulate fails to hold. Furthermore, people were investigating point/line questions on the sphere, a curved surface rather than a flat one, but where there were "natural" suggestions for what might be "lines" and points on such a surface. In the end, spherical geometry evolved into two different approaches. In the more interesting approach a point was considered to be a pair of points which were at opposite ends of a diameter. It may seem strange to think of two things being identified so that they are considered one, but this point of view is very fruitful because now "two points" determine a unique line, a great circle of the sphere, which is a circle centered at the center of the sphere.

Many people who know some geometry are puzzled by the difference between longitude lines and latitude lines. Longitude lines are great circles which cut through the idealized sphere that represents the Earth at the south and north poles. The circle that is "halfway" between the poles, lying in a plane perpendicular to the line joining the poles, is a great circle too, and it is the equator of the system. But all the other latitude lines that are "parallel" to the equator (in the sense that they don't meet the equator) are not great circles. They are circles which represent the intersection of a plane parallel to the equator, with the sphere.

Incidences

Suppose we are given a "configuration" C of points and lines in the Euclidean plane; we have a finite set P containing $m$ points in the Euclidean plane and a finite set L of $n$ lines which contain some of the points of P. It is possible that the lines of L intersect at points not in P (see Figure 8) and it is possible that L does not contain all of the lines determined by pairs of points in P (Figure 2). One might be interested in counting how many point-line pairs there are in the configuration C. A point-line pair, a point $p$ of set S which is on line $l$ of L is called an incidence. We are not concerned with lengths of segments or with angles between the lines.

The point-line configuration in Figure 8 has 7 straight lines (so set L has 7 elements) and 5 points (so set P has 5 elements). Note that the lines are displayed to suggest they are infinite. The section between two points will be referred to as a segment and when we have in mind the "whole" line that a segment lies in we will call that a line. Some pairs of the points shown do not lie on any line of L. One might consider adding lines between points not already joined, and adding to our configurations points that are obtained from the intersection of such a line with an existing line (assuming that it meets an existing line in a point that is not already in S). However, in what follows we are only concerned with the points and lines in the set of points P given to us and the set of lines L given to us. For Figure 8, of the 5 points (dark dots), one has 2 lines going through it, three points have 3 lines going through them, and one point has 4 lines going through it. For the lines, 6 lines have two points on them and one line has 3 points on it.

Configuration of points and lines, 15 incidences

Figure 8 (An irregular configuration of points and lines starting with a point set P with 5 elements and a line set L with 7 elements.)
 

For practice you may want by direct count to verify that this configuration has 15 point-line incidences and 10 different segments. (The line with three points has 3 segments in all.) You can count the number of points, lines, segments, number of lines with s points and number of points through which t lines pass and incidences you see in Figure 8. You should find 15 incidences, and notice you can count these by concentrating on each line in turn and finding how many points there are on each of the lines or by concentrating on the point and seeing how many lines pass through each of the points! Given the diagram shown in Figure 2 where we have 5 points in the set P and 3 lines in the set L, we can count 9 incidences in this diagram.

Shortly we will discuss a remarkable theorem about counting incidences–the Szemerédi-Trotter Theorem ). But to fully appreciate the circle of ideas involved in this result I will discuss some ideas that seem initially not related to incidences. When most people hear the word graph they think of the graph of a function. When Descartes extended the "synthetic" geometry of Euclid to the analytical geometry of today, it became possible to graph algebraic expressions. Historically this helped with the rapid and fruitful development of the Calculus.

However, there is another more recent use of the word graph. It belongs to the domain of discrete mathematics rather than the domain of continuous mathematics and involves the geometric structures consisting of dots and lines or dots and curves. Graphs of this kind consist of a number of points called vertices (singular vertex) and lines/curves which join up pairs of vertices which are called edges. Figure 9 shows a sample of some graphs.

Three graphs of diffent kinds

Figure 9 (A sample of three different graphs, each with one connected piece.)
 

Each of these three graphs consists of a single piece, and such a graph is called connected–intuitively there is a way of taking any pair of vertices in the graph and getting between them by moving along the edges of the graph. In principle we might have many different lines/curves which connect up a pair of vertices, and we might have vertices that are joined to themselves by a curve (or broken line) but these structures, sometimes called multi-graphs, pseudographs, or graphs with loops and multiple edges will not be considered here. Some graphs have the property that they have been drawn in the plane so that edges meet only at vertices. Such graphs are called plane graphs, and graphs which have the potential to be drawn in the plane so that edges meet only at vertices are known as planar graphs. Figure 10 shows an example of a planar graph on the left and a structurally equivalent graph, an isomorphic graph, on the right; the single crossing has been eliminated.

Planar graph and a plane drawing of this graph

 

Figure 10 (A graph (left) drawn in the plane showing one crossing. This graph is planar because it can be redrawn as shown on the right to get the plane graph with the one crossing eliminated.)
 

While graph theory is a relatively recent subarea of geometry, a relatively recent subarea of graph theory, with mathematical and computer science implications, has been the subject of graph drawing. This domain of ideas concerns questions about the nature of the appearance of different ways a graph might be drawn in the plane.

When a graph is drawn in the plane there may be many different drawings to represent isomorphic copies of the same graph. In particular one might ask for a drawing with the minimum number of crossings for that graph. This number is called the crossing number of the graph. In fact, the crossing number in itself has interesting variants because there can be a difference in the number of crossings when all of the edges are drawn with straight lines versus the number of crossings when the edges are allowed to be curves.

The Szemerédi-Trotter Theorem

A remarkable accomplishment in understanding the behavior of counting the number of incidences for a collection of points and lines was achieved by what has come to be called the Szemerédi-Trotter Theorem. This result is named after William Tom Trotter and Endre Szemerédi.

Photo of Endre Szemerédi Photo of W. T. Trotter

Figure 11 (Photos of Endre Szemerédi and William (Tom) Trotter. Courtesy of Wikipedia and Tom Trotter respectively.)
 

We will first set the stage for the remarkable result of Szemerédi and Trotter, which was proved in 1983.

Szemerédi-Trotter Theorem:

Suppose that we have a collection (set) of points P with m distinct points and a collection of distinct lines L with n lines in the Euclidean plane. As we saw above we will call an incidence a point-line pair (p,l) where p belongs to P and l belongs to L and point p lies on line l. Such pairs are called incidences. What is the maximum number of possible incidences, calculated in terms of the values of m and n? If we denote by I(P,L) the number of incidences for the particular given sets P and L, then we will denote by I(m,n) the largest value of I(P,L) as for any choice of P with m elements and any choice of L with n elements. Notice that in the statement of the theorem below, m and n appear in a symmetrical way. This suggests that perhaps one can get insights into points and lines when we try interchanging their roles. Such a "duality" of points and lines holds fully in projective geometry, where any pair of lines always intersects.

Theorem (Szemerédi-Trotter, 1983): 

$$I(m, n) = O\left(m^{2/3}n^{2/3} + m + n\right)$$

Informally, this theorem gives the order of magnitude that is possible for the largest number of incidences that can occur for a collection of points P with $m$ points and a collection of lines L with $n$ lines.

It is possible to show that this result is "best possible" but for the fact that the exact constant that holds for the expression above is not yet known (see below).

Before looking in more detail about the content of this result here are a few remarks about "big O" notation used in the description of the Szemerédi-Trotter theorem. The notation has a story of its own. It was originated by the German mathematician Edmund Landau (1877-1938) who is most famous as a number theorist.

Photo of Edmund Landau

Figure 12 (Photo of Edmund Landau)
 

In recent times big O notation is often discussed in the context of trying to determine how long it will take to run (how many steps are required) a computer algorithm in terms of some parameter of the size of the problem, typically denoted with the letter $n$. However, Landau was interested in the more general issue of comparing the growth of different functions expressed using the variable, say $n$. While sometimes in the popular press it is stated that some phenomenon is showing exponential growth, to some extent what is meant is that the phenomenon in the short run is growing faster than other functions that people may have studied in a mathematics class, like $n^2$.

The "formal" definition of "$g(x)$ is $O(f(x))$" is that $g$ obeys this property: if $x$ is larger than some fixed integer N, there is a constant C such that:

$$g(x) \leq C(f(x)) \textrm{ for } x \geq N $$

This definition says nothing about how large C or N might be. For some results of this type C and N can be huge and, thus, the result is sometimes not of "practical" computational interest.

The use of big O notation created interest in other ways to notate growth of functions or number of steps in running an algorithm. While notation may not seem an important part of the practice of mathematics, it is part of the craft by which new mathematical ideas are born and new theorems are developed. So you might want to look at the meaning of small o and big omega as notations for grown of functions.

Now that we know the meaning of big O notation we can see that the Szemerédi-Trotter Theorem says that for large enough values of $m$ and $n$, the maximum number of incidences is less than some constant times:

$$m^{2/3}n^{2/3} + m + n.$$

There is another way that the Szemerédi-Trotter Theorem can be stated that gives equivalent information but has a seemingly different flavor. Suppose we are given $m$ points in the plane and a positive integer $i \geq 2$, then the number of lines which contain at least $i$ of these points is given by:

$$O(m^2/i^3 + m/i).$$

In retrospect, it is common that the first proofs of important results are not always as insightful and as "clear" as later proofs. The original approach that Szemerédi and Trotter used to prove their result is not usually discussed because proofs that came after theirs were noticeably simpler. There are two simpler approaches to the proof that are now widely reported as approaches to this theorem. One approach exploited the use of configurations that involved a grid of points, an approach associated with the work of György Elekes(1949-2008), who died sadly at a relatively young age, as well as work due to Laszlo Szekely, which is hinted at below that involves the notion of crossing number.

One tool that has increasingly emerged as useful in discrete geometry in putting to work geometrical insights is graph theory. As we saw, graph theory concerns diagrams such as that in Figure 11.

Kuratowski graphs, contained in any non-planar graph

 

Figure 13 (Two graphs, which require some crossings when drawn in the plane, sometimes called the Kuratowski graphs, that serve as "obstacles" to drawing a graph in the plane. Any graph which can't be drawn in the plane without having edges that cross at points other than vertices must in a certain sense be a "copy" of one of these two graphs (or both) within it.)

 

Photo of K. Kuratowski

Figure 14 Kazimierz Kuratowski (1896-1980)
 

Both of the graphs shown in Figure 13 can be regarded as a collection of points in the Euclidean plane. Note that we could have joined some of the vertices with curves to avoid the crossings that occur in the diagram but we can also interpret this diagram as a "geometric graph" where we have lines which include the points where the lines cross at what are not vertices of the original graph. One approach to connecting graph theory to questions about point and line incidences is forming a graph from the diagram representing a point set P and a line set L by constructing a graph which consists of the vertices that correspond to the points in P and joining two vertices $u$ and $v$ of this graph by edge if there is a line segment in the point/line diagram between $u$ and $v$. This leads to a graph that may be isomorphic to a plane graph, one that can be drawn in the plane without any edges meeting or crossing at points other than vertices, or may be a non-planar graph, but will typically have edges meeting at points which are called crossings–points that are not vertices of the graph involved.

Given a graph, there are many properties about the graph which one can inquire about. Thus, one can ask if a graph is planar, has a simple circuit tour that visits each vertex once and only once in a simple circuit (useful in designing a package delivery routes), has a tour of its edges that visits each edge once and only once (useful in designing snowplowing routes), etc. A natural question to ask is for the different ways to draw a graph in the plane with straight line edges (e.g. the different drawings are isomorphic), what are the smallest number and largest number of crossings of edges that can occur?

Given a graph G, cr(G) is often defined as the minimum number of crossings of edges not at vertices in any drawing of the graph G. Both graphs in Figure 13 can be redrawn with 1 crossing and, thus, have crossing number 1. The first graph shown has 5 crossings in the drawing shown and if edges are not allowed to intersect themselves or cross other edges more than once, this drawing has 5 crossings, which is the largest number of crossings. Depending on what "concepts" one is trying to understand one often puts rules on the allowed drawings. The bottom graph (Figure 13) shows three edges that pass through a single point; for some purposes such a drawing would not be allowed and one is required to use drawings where at most two edges meet at a crossing.

Given a particular graph G, to decide for any integer $c$ if the crossing number $\textrm{cr}(G) \leq c$ is known to be NP-complete. This means that it is very unlikely that there is a polynomial time algorithm for determining the solution to this question. (There are hundreds of NP-complete problems and if any one of them could be solved in polynomial time, then it would be possible to solve them all in polynomial time–no one knows!)

Over many years there has been a progression of results concerning getting bounds on the crossing number of a graph. Using the result known as Euler's Formula that for any planar connected (one piece) graph, V + F – E = 2 where V, F and E denote the number of vertices, faces and edges of the graph. It is not difficult to see that from this result the crossing number of a general graph with V vertices (where V is at least 3) and E edges satisfies $$\textrm{cr}(G) \geq E – 3V + 6.$$

However, making various assumptions about the density of the number of edges (how rich in edges the graph is) of a graph one can get more "interesting" bounds on the crossing number.

Theorem:

$$ \textrm{If } E \geq 4V \textrm{ then cr}(G) \geq (1/64)\left(E^3/V^2\right) $$

Various variants of this result are being explored where the "edge density" is changed, in attempts to understand the value of the constant (above 1/64) in the crossing-number expression. Thinking of a configuration of points and lines in the plane as a graph drawn in the plane with crossings allows one to get the expression for incidences in the Szemerédi-Trotter Theorem.

What else can we say about the crossing number of interesting graphs? One particularly interesting family of graphs is the graphs with $n$ vertices, every pair of vertices being joined by exactly one edge–graphs known as complete graphs. Traditionally this graph is denoted $K_n$.

$K_n$ has $(n)(n-1)/2$ edges since each vertex is joined to the all of the other $n-1$ vertices, and each edge will get counted twice, once at each of its endpoints, which gives the result $(n)(n-1)/2$.

Thus, $\textrm{cr}(K_n) \geq (n)(n-1)/2 – 3n + 6,$ which simplifies to $(n^2/2) – (7/2)n + 6$.

It might seem that it would be easy to find the exact value of the crossing number of $K_n$ which is very symmetrical combinatorially, but perhaps rather surprisingly the exact value of this crossing number for every $n$ is not yet known. It has been conjectured to be Z(n):

$$(1/64)(n)(n-2)^2(n-4), \textrm{ when n is even, and}$$

$$(1/64)(n-1)^2(n-3)^2, \textrm{ when n is odd.}$$

You can check that this formula gives the crossing number of the complete graph with 5 vertices to be 1. Various people (Paul Turán and the recently deceased Richard Guy) have been involved with the "proper" guess for the minimum number of crossings of the different types of "complete" graphs drawn in the plane but the "formulas" above are sometimes associated with the name of the Polish mathematician Kazimierz Zarankiewicz. As is the spirit of mathematics, generalizing interesting ideas and concepts, there are now many kinds of crossing numbers that have been explored.

Photo of K. Zarankiewicz

 

Figure 15 (Kazimierz Zarankiewicz (1902-1959))
 

The Szemerédi-Trotter Theorem initially was concerned with incidences between points and lines. Over a period of time it resulted in new theorems related to incidences between points and pseudolines (lines are to uncooked spaghetti as pseudolines are to cooked spaghetti), points and circles of the same radius, points and circles with different radii, points on parallel lines, etc. In this line of work the result was generalized from points and lines to points and other collections of geometric sets. But if one can discuss points and lines in 2-dimensional space, why not look at points and lines in higher-dimensional spaces, or lines and planes in higher-dimensional spaces? You should not be surprised that all of these have come to be looked at and much more.

While these are seemingly natural extensions of the ideas related to the Szemerédi-Trotter Theorem there is again the element of surprise in mathematics. The Szemerédi-Trotter Theorem seems rooted in combinatorial geometry–counting things–but it also has metrical, distance-related ties. While Paul Erdős played a part in the history of the Szemerédi-Trotter Theorem (not directly discussed above) he also had a part in the this other thread of results. Given a set of points P in the plane, with different conditions on where the points are arranged with respect to each other and perhaps their pattern on lines that pass through the points, one can measure the Euclidean distance (or other distances) between points in the set P. Now one is interested in studying how few or how many distances can occur. There are many papers about counting distinct distances, repeated distances, etc.!

From little acorns great oaks grow. Incidences may seem like an "acorn" compared with other issues in mathematics but it has been an acorn that has grown impressive results. I hope you enjoy Mathematics and Statistics Awareness Month and try your hand at learning and doing some new mathematics.

References

Those who can access JSTOR can find some of the papers mentioned above there. For those with access, the American Mathematical Society's MathSciNet can be used to get additional bibliographic information and reviews of some of these materials. Some of the items above can be found via the ACM Digital Library, which also provides bibliographic services.

Beck, J., On the lattice property of the plane and some problems of Dirac, Motzkin and Erdős in combinatorial geometry, Combinatorica 3 (1983), 281–197.

Brass, P. and W. Moser, J. Pach, Research Problems in Discrete Geometry, Springer, New York, 2005.

Clarkson K. and H. Edelsbrunner, L. Guibas, M. Sharir, E. Welzl, Combinatorial complexity bounds for arrangements of curves and spheres, Discrete Comput. Geom. 5 (1990) 99–160.

Dvir, Z., On the size of Kakeya sets in finite fields. J. Amer. Math Soc. 22 (2009) 1093–1097.

Edelsbrunner, H.: Algorithms in Combinatorial Geometry. Springer- Verlag, Heidelberg (1987)

Elekes, G., On linear combinatorics I, Concurrency—an algebraic approach, Combinatorica 17 (4) (1997), 447–458.

Elekes, G., A combinatorial problem on polynomials, Discrete Comput. Geom. 19 (3) (1998), 383–389.

Elekes, G., Sums versus products in number theory, algebra and Erdős geometry–A survey. in Paul Erdős and his Mathematics II, Bolyai Math. Soc., Stud. 11, Budapest, 241–290 (2002)

Elekes, G., and H. Kaplan, M. Sharir, On lines, joints, and incidences in three dimensions. J. Combinat. Theory, Ser. A. 118 (2011) 962–277

P. Erdős, Problems and results in combinatorial geometry, in Discrete Geometry and Convexity (New York, 1982), vol. 440 of Ann. New York Acad. Sci., 1985, pp. 1–11.

Felsner, S., Geometric Graphs and Arrangements, Vieweg, Wiesbaden, 2004.

Grünbaum, B., Configurations of Points and LInes, Graduate Studies in Mathematics Volume 103, American Mathematical Society, Providence, 2009.

Guth, L. and N. Katz, Algebraic methods in discrete analogs of the Kakeya problem. Advances Math. 225, 2828–2839 (2010)

Guth, L. and N. Katz, On the Erdős distinct distances problem in the plane. Annals Math. 181, 155–190 (2015), and in arXiv:1011.4105.

Kaplan, H., and M. Sharir, E. Shustin, On lines and joints, Discrete Comput. Geom. 44 (2010) 838–843.

Kollar, J., Szemerédi-Trotter-type theorems in dimension 3, Adv. Math. 271 (2015), 30–61.

Matousek, J., Lectures in Discrete Geometry, Springer, New York, 2002.

Pach, J. (ed), Towards a Theory of Geometric Graphs, Contemporary Mathematics 342, American Mathematical Society, Providence, 2004.

Pach, J., and P. Agarwal, Combinatorial Geometry, Wiley, New York, 1995.

Pach, J., and R. Radoicic, G. Tardos, and G. Toth, Improving the Crossing Lemma by ?nding more crossings in sparse graphs, Discrete Comput. Geom. 36 (2006) 527–552.

Pach, J. and M. Sharir, Repeated angles in the plane and related problems, J. Comb. Theory, Ser. A 59 (1992) 12–22.

Pach, J. and M. Sharir, On the number of incidences between points and curves, Combinatorics, Probability and Computing (1998), 121–127.

Pach, J. and M. Sharir, Geometric incidences, pp. 185–224, in Towards a theory of geometric graphs, vol. 342 of Contemporary Mathematics, AMS, Providence, RI, 2004.

Pach, J. and G. Toth, Graphs drawn with few crossings per edge, Combinatorica 17 (1997) 427–439.

Quilodran, R., The joints problem in Rn . Siam J. Discrete Math. 23 (2010) 2211–2213

Sharir, M. and E. Welzl, Point-line incidences in space. Combinat.
Prob., Comput., 13 (2004) 203–220

Sheffer, A., Distinct Distances: Open Problems and Current Bounds
https://arxiv.org/abs/1406.1949

Solymosi, J., On the number of sums and products, Bulletin of the London Mathematical Society, 37 (2005) 491–494

Solymosi, J. and Tao, T.: An incidence theorem in higher dimensions. Discrete Comput. Geom. 48 (2012) 255–280

Szekely, L., Crossing numbers and hard Erdős problems in discrete geometry, Combinatorics, Probability and Computing 6 (1997) 353–358.

Szemerédi, E., and W. T. Trotter, Extremal problems in discrete geometry, Combinatorica 3 (1983) 381-392.

Tomassia, R. (ed.), Handbook of Graph Drawing and Visualization, CRC Press, Boca Raton, 2014.

Tomassia, R. and I. Tollis (eds.), Graph Drawing, Lecture Notes in Computer Science 894, Springer, New York, 1995.

Toth, C., The Szemerédi–Trotter theorem in the complex plane, Combinatorica, 35 (2015) 95–126

Zahl, J.. and A Szemerédi–Trotter type theorem in R4, Discrete Comput.
Geom. 54, 513–572 (2015)

AMS Feature Column banner

From Strings to Mirrors

To tell you where mirror symmetry came from, I have to tell you about string theory. And to do that, I have to tell you why you should care about string theory in the first place. That story starts with an old question: what is the smallest piece of the universe? …

Ursula Whitcher
AMS | Mathematical Reviews, Ann Arbor, Michigan

Introduction

Scientists often use mathematical ideas to make discoveries. The area of research mathematics known as mirror symmetry does the reverse: it uses ideas from theoretical physics to create mathematical discoveries, linking apparently unconnected areas of pure mathematics.

 

Maxim Kontsevich
Maxim Kontsevich has won many recognitions, including the Fields Medal, for his research in mirror symmetry. (Photo by Mathematisches Institut Oberwolfach, CC BY-SA 2.0 DE.)

 

String theory

To tell you where mirror symmetry came from, I have to tell you about string theory. And to do that, I have to tell you why you should care about string theory in the first place. That story starts with an old question: what is the smallest piece of the universe?

Here's a rapid summary of a couple of thousand years of answers to this question in the West. The ancient Greeks theorized that there were four elements (earth, air, fire, water), which combined in different ways to create the different types of matter that we see around us. Later, alchemists discovered that recombination was not so easy: though many chemicals can be mixed to create other chemicals, there was no way to mix other substances and create gold. Eventually, scientists (now calling themselves chemists) decided that gold was itself an element, that is, a collection of indivisible components of matter called gold atoms. Continued scientific experimentation prompted longer and longer lists of elements. By arranging these elements in a specific way, Dmitri Mendeleev produced a periodic table that captured common properties of the elements and suggested new, yet-to-be discovered ones. Why were there so many different elements? Because (scientists deduced) each atom was composed of smaller pieces: protons, neutrons, and electrons. Different combinations of these sub-atomic particles produced the chemical properties that we ascribe to different elements.

 

TRIUMF particle accelerator
The TRIUMF particle accelerator is used in high-energy physics research.

 

This story is tidy and satisfying. But there are still some weird things about it: for example, protons and neutrons are huge compared to electrons. Also, experiments around the beginning of the twentieth century suggested that we shouldn't just be looking for components of matter. The electromagnetic energy that makes up light has its own fundamental component, called a photon. The fact that light sometimes acts like a particle, the photon, and sometimes like a wave is one of the many weird things about quantum physics. (The word "quantum" is related to "quantity"—the idea that light might be something we can count!)

Lots and lots of work by lots and lots of physicists trying to understand matter and energy particles, over the course of the twentieth century, produced the "Standard Model." Protons and neutrons are made up of even smaller components, called quarks. Quarks are held together by the strong force, whose particle is a gluon. The weak force, which holds atoms together, has its own force particles. The full Standard Model includes seventeen different fundamental particles.

 

Artist's impression of the quark zoo
A zoo of quarks: up, down, charm, strange, top, and bottom (Art by Levi Qışın).

 

There are two theoretical issues with the Standard Model. One is essentially aesthetic: it's really complicated. Based on their experience with the periodic table, scientists suspect that there should be some underlying principle or structure relating the different types of particles. The second issue is more pressing: there's no gravity particle. Every other force in the universe can be described by one of the "force carrier" particles in the Standard Model.

 

Artist's impression of the quark zoo
The Higgs boson is the most recently discovered particle in the standard model (Art by Levi Qışın).

 

Why is gravity different? The best description we have of gravity is Einstein's theory of general relativity, which says gravitational effects come from curvature in the fabric of spacetime. This is an excellent way to describe the behaviors of huge objects, such as stars and galaxies, over large distances. But at small distance scales (like atoms) or high energies (such as those seen in a particle accelerator or in the early universe), this description breaks down.

People have certainly tried to create a quantum theory of gravity. This would involve a force carrier particle called a graviton. But the theory of quantum physics and the theory of general relativity don't play well together. The problem is the different ways they treat energy. Quantum physics says that when you have enough energy in a system, force-carrier particles can be created. (The timing of their appearance is random, but it's easy to predict what will happen on average, just as we know that when we flip a coin over and over, we'll get tails about half the time.) General relativity says that the shape of spacetime itself contains energy. So why aren't we detecting random bursts of energy from outer space, as gravitons are created and destroyed?

String theory is one possible answer to this question. String theory says that the smallest things in the universe are not point particles. They extend in one dimension, like minuscule loops or squiggles— thence the name string. Strings with different amounts of energy correspond to the particles with different properties that we can detect in a lab.

 

A string-like loop
A diagram of a string

 

The simplicity of this setup is compelling. Even better, it solves the infinite energy problem. Interactions that would occur at a particular moment in spacetime, in the point particle model, are smoothed out over a wider area of spacetime if we model those interactions with strings. But string theory does pose some conceptual complications. To explain them, let's look at the underlying mathematical ideas more carefully.

In general relativity, we think of space and time together as a multidimensional geometric object, four-dimensional spacetime. Abstractly, the evolution of a single particle in time is a curve in spacetime that we call its worldline. If we start with a string instead of a point particle, over time it will trace out something abstractly two-dimensional, like a piece of paper or a floppy cylinder. We call this the worldsheet. One can imagine embedding that worldsheet into higher-dimensional spacetime. From there, we have a standard procedure to create a quantum theory, called quantization.

 

A worldsheet diagram
A worldsheet

 

If we work with four-dimensional spacetime, we run into a problem at this point. In general relativity, the difference between time and the other, spatial dimensions is encoded by a negative sign. Combine that negative sign with the standard quantization procedure, and you end up predicting quantum states—potential states of our universe, in this model—whose probability of occurring is the square root of a negative number. That's unphysical, which is a nice way of saying "completely ridiculous."

Since every spatial dimension gives us a positive sign, we can potentially cancel out the negatives and remove the unphysical states if we allow our spacetime to have more than four dimensions. If we're trying to build a physical theory that is physically realistic, in the sense of having both bosonic and fermionic states (things like photons and things like electrons), it turns out that the magic number of spacetime dimensions is ten.

If there are ten dimensions in total, we have six extra dimensions! Since we see no evidence of these dimensions in everyday life, they must be tiny (on a scale comparable to the string length), and compact or curled up. Since this theory is supposed to be compatible with general relativity, they should be "flat" in a precise mathematical sense, so their curvature doesn't contribute extra gravitational energy. And to allow for both bosons and fermions, they should be highly symmetric. Such six-dimensional spaces do exist. They're called Calabi-Yau manifolds: Calabi for the mathematician who predicted their existence, Yau for the mathematician who proved they really are flat.

 

A slice of a Calabi-Yau manifold
A slice of a Calabi-Yau manifold

 

String dualities

One of the surprising things about string theory, and one of the most interesting from a mathematical perspective, is that fundamentally different assumptions about the setup can produce models of the universe that look identical. These correspondences are called string dualities.

The simplest string duality is called T-duality (T is for torus, the mathematical name for doughnut shapes and their generalizations). Suppose the extra dimensions of the universe were just a circle (a one-dimensional torus). A string's energy is proportional to its length; we can't directly measure the length of a string, but we can measure the energy it has. However, a string wrapped once around a big circle and a string wrapped many times around a small circle can have the same length! So the universe where the extra circle is radius 2 and the universe where the radius is ½ look the same to us. The same holds for the universes of radius 3 and 1/3, 10 and 1/10, or generally $R$ and $1/R$.

 

T-duality
Strings wrapped once or many times

 

But what if we want a more physically realistic theory, where there are six extra dimensions of the universe? Well, we assume that the two-dimensional string worldsheet is mapping into these six extra dimensions. Our theory will have various physical fields, similar to the electromagnetic field.

To keep track of what a particular field is doing back on the worldsheet, we use coordinates $x$ and $y$. We can combine those coordinates into a single complex number $z$ = $x$ + $iy$. That $i$ there is an imaginary number. When I first learned about imaginary numbers, I was certain they were the best numbers, since they used the imagination; I know that "Why are you wasting my time with numbers that don't even exist?" is a more typical reaction. In this case, though, $i$ is standing in for a very concrete concept, direction: changing $x$ moves right or left, while changing $iy$ moves up or down. If we simultaneously increase $x$ a little bit and $y$ a little bit, we'll move diagonally right and up; we can think of that small diagonal shift as a little change in $z$ = $x$ + $iy$. If you want to be able to move all around the plane, just increasing or decreasing $z$ like this isn't enough. Mathematicians use $\bar{z}$ = $x$ – $iy$ to talk about motion that goes diagonally right and down.

 

z and z conjugate in the complex plane
$z$ and $\bar{z}$

 

Now, back to building our string theory. The fields depend on $x$ and $y$, but they're highly symmetric: to figure out how they act on the whole worldsheet, it's enough to know either how they change either based on a little change in $z$, or based on a little change in $\bar{z}$ (so we don't have to measure right-and-up and left-and-down changes separately). If you have two fields like this, they might change in similar ways (both varying a lot due to small changes in $z$, say), or they might change in different ways (one depending on $z$ and the other on $\bar{z}$).

From the physics point of view, this choice is not a big deal. You're just choosing either two plus signs (this choice is called the B-model) or a plus and a minus sign (the A-model). Either way, you can carry on from there and start working out all the physical characteristics of these fields, trying to understand predictions about gravity, and so on and so forth. Because this choice really doesn't matter, it shouldn't make any difference to your eventual predictions. In particular, any type of universe you can describe by choosing two plus signs and working out the details should also be a type of universe you can describe by choosing one plus and one minus, then working out those details.

How do we match up those two types of universes? By choosing different shapes for the six extra dimensions. Using this logic, physicists predicted that if you picked a specific shape for the extra dimensions of the universe and worked out the details of the A-model, you should be able to find a different shape that would give you the same physical details once you worked out its B-model theory.

Now, I said the sign choice wasn't a big deal from the physical perspective. But it's a huge deal from the mathematical perspective. If you only choose plus signs, you can rewrite everything that happens in terms of just powers of $z$, and start doing algebra. Algebra is great! You can program your computer to do algebra, and find lots of information about your six-dimensional space really fast! On the other hand, if you choose one plus and one minus sign, you're stuck doing calculus (a very special kind of multivariable, multidimensional calculus, where experts engage in intense arguments about what sorts of simplifying assumptions are valid).

Thus, when physicists came along and said, "Hey, these two kinds of math give you the same kinds of physical predictions," that told mathematicians they could turn incredibly difficult calculus problems into algebra problems (and thereby relate two branches of mathematics that had previously seemed completely different). Mathematicians call this insight, and the research it inspired, "mirror symmetry."

 

Xenia de la Ossa
Xenia de la Ossa lectures on mirror symmetry. (Photo by Joselen Pena, CC BY-SA 4.0.)

Further reading

If you would like to learn more about current research in mirror symmetry, here are some resources!