Galton's most important statistical insights

After reading Adam Rutherford's book about the history of eugenics (see here for my review) I found myself motivated to try to understand exactly what Francis Galton (the English polymath and father of eugenics) had contributed to the development of statistics. In my review of the Rutherford book I argued that it is possible to consider someone's scientific contributions without that being any sort of endorsement of, or mitigation for, their social and political views. As we will see, Galton's work in the last quarter of the 19th century was absolutely critical in understanding and formalising the nature and strength of relationships between two variables (what we would now call the correlation between them). This is such a fundamental aspect of measurement and data analysis in so many areas of science that its conceptual origins are worth revisiting from time to time. As I will argue in a later blog entry, some central and very useful parts of Galton's thinking on these statistical issues have been slightly overlooked/forgotten in the ensuing century.

Galton's insights came about by visualising various datasets in a way which made clear a fundamental feature of the covarying relationship between a pair of variables; we will see that Galton's visualisation revealed something we now call the bivariate normal distribution

I already had some idea of his many statistical contributions, but lacked a clear memory of the details. Therefore, I started reading some of the original papers he wrote in the 1870s and 1880s. This is physically easy to do, as his extensive written output is available via an extensive online archive. However, Galton was so prolific, with many papers having similar or identical titles, that I felt I needed a sense of the historical sequence of his statistical works. So, I turned to my usual source on the history of statistics: Professor Steven Stigler from the University of Chicago. He had, as I expected, written at length on Galton. I found two pieces (Stigler, 1989; Stigler, 2010) particularly enlightening. In addition, a clear timeline of the key stages in the history of the statistical concept of correlation is given in a lovely paper by Rodgers and Nicewander (1988). 

Stigler (2010) suggests that Galton's (1877) paper was the first big step. Galton was trying to resolve a fundamental puzzle which might be seen as undermining Darwin's theory of natural selection. It is worth mentioning that Darwin was Galton's second cousin. Evolution by natural selection requires variation within a species in order to operate and yet, from generation to generation, a species seems largely stable, to all intents and purposes. Galton claimed to have resolved this puzzle in 1877. His argument was built on two planks: the quincunx and the phenomenon of reversion.

The quincunx was a mechanical simulator of the normal distribution which Galton had devised, constructed, and presented a few years earlier. There are many online renderings of this incredible device: this one is especially nice. Galton chose the name because its construction was based upon many adjacent sets of pins each arranged in the 5-dot pattern, known as a quincunx. Galton used the quincunx to generate ideas and illustrate his arguments about the operation of statistical processes, especially in connection with heredity.

Reversion, as Galton then termed it, was a phenomenon which had been observed for centuries by animal breeders and horticulturalists: the fact that features of subsequent generations tended to revert towards the past. Darwin was aware of reversion and considered it an obvious problem for his theories, as it seemed likely to operate against the forces of natural selection. It is therefore slightly paradoxical to learn that Galton saw reversion as a way to address potential issues with evolution by natural selection.

Galton had studied a particular aspect of reversion by collecting data on the size of seeds in successive generations of sweet peas. He was interested in what would happen if he bred from groups of peas seeds which had been selected so that they were larger (or smaller) than average, by varying amounts. Galton observed that the offspring pea seeds tended to be of a size that was closer to the average size of pea seeds, than were the seeds of their pre-selected parents. This kind of reversion we now call regression to the mean (yet another statistical term which Galton coined). The methods by which Galton gathered his information about sweet pea seeds are an early example of crowd-sourced, citizen science data collection (see Galton, 1877; 1886b for details). 

Galton made two key obsevations from the sweet pea data. First, he found that that the degree of reversion (regression) was proportional to the distance of the parent pea seed size from the mean. Let's call that proportion q. Naturally, because q is a proportion, it is in the range 0 to 1. If the parent pea seeds were x 100ths of an inch bigger than the mean of all pea seeds then, on average, the offspring pea seeds would be q*x 100ths of an inch above the mean. If the parent peas were y 100ths of an inch smaller than the mean of all pea seeds then the offspring peas would, on average, be q*y 100ths of an inch below the mean. In 1877 Galton did not offer an estimate, based on his pea seed study, for the value of what I have called q above. In Galton (1886b) he elaborated on this, first by saying that the proportionality of the regression towards the mean "was based on so many plantings, conducted for me by friends living in various parts of the country, from Nairn in the north to Cornwall in the south, during one, two, or even three generations of the plants, that I could entertain no doubt of the truth of my conclusion [p. 246]". In Appendix I of that paper he gave the pea data in some detail and stated his conclusion that q was one third for peas: "It will be seen that for each increase of one unit on the part of the parent seed, there is a mean increase of only one-third of a unit in the filial seed [p. 259]".

Second, Galton found that, although the mean size of the offspring reverted towards the mean of previous generations, the variation in pea seed sizes in the offspring generation remained constant. It did not matter how far the parent seeds were above or below the mean seed size, the resulting sample of offspring seeds would vary in size by the same amount around the mean. Galton was obviously surprised to find this result so clearly, saying (1877, p. 291): "I was certainly astonished to find the family variability of the produce of the little seeds to be equal to that of the big ones".

Galton put these two observations together to offer a resolution to conceptual issues with the operation of natural selection. Despite evolution by natural selection tending to lead to the best-suited outcome for some features of a species (e.g., leading to an optimal average pea size), "something else" had to be at work to maintain the constant amount of variation in those features in successive generations. This continuing variation, now confirmed empirically by Galton, would be able to provide the basis for future potential changes by natural selection, if environmental circumstances were to change. Galton suggested that reversion was that "something else"; reversion was the mystery ingredient maintaining the constant degree of variation within each successive generation.

It was typical of Galton that he sketched out variants of the quincunx,p which illustrated these two processes (natural selection, reversion) working in opposing directions to maintain variation of a feature in successive generations (see Stigler, 2010, for full details, including Galton's sketches of the revised quincunxes). Stigler points out that this "solution" was a fudge, and suggests that Galton realised pretty quickly that it was flawed. The 1877 solution required that the opposing forces of natural selection and reversion were (generally) in perfect balance so that variation in the ensuing variations stayed constant; and this exact balance was highly implausible.

Galton’s attempt to fix the problems with his 1877 model led to the breakthroughs that he first presented as his Presidential Address to the anthropology section of the British Association for the Advancement of Science (BAAS; Galton, 1885).  Stigler (2010) calls this point in 1885 the launch of the "statistical enlightenment".  In particular, he says this work was "Galton’s single most important and best known contribution to statistics: he introduced ... the idea of thinking of the [data from] two generations as a bivariate normal pair, with two different and conceptually distinct lines of conditional expectation, the ‘regression’ lines, to use the term that he introduced here in place of reversion [p. 476, emphasis added]".

In a series of shorter papers in Nature, Science, and elsewhere, Galton reproduced parts of his address to the BAAS, although the fullest exposition was in a pair of longer papers (Galton, 1886a,b).

Galton's insights for the 1885 address were based on analyses of different data. He stated that it was important to corroborate the sweet pea observations in an entirely independent dataset. After some practical struggles, he eventually managed to get data on the height of 928 children and their 205 pairs of parents. He plotted heights of the children against the average ("mid-parental") heights of their respective mothers and fathers, multiplying each female height by 1.08 to allow for sex differences in the heights of men and women. The data were published as Table 1 in Galton (1886b), and then graphed in revealing ways. The first children vs. parents graph (from Galton, 1886b; Plate IX) is shown below, illustrating the same “regression to the mean” phenomenon he had observed in the sweet peas.  


The graph above plots the data in 9 height bands: the median of the children's heights within that band (open circles) in relation to the median of their parents' heights (short horizontal lines), along with an approximate line of best fit. As noted, the plot reveals regression to the mean similar to that for the pea seeds. However, in this case the proportion of regression, q, is equal to 2/3 rather than the 1/3 observed for the peas. Galton noted that the same height data revealed another regression relationship. He specifically stated that the data showed "when it is read [...] in vertical columns instead of in horizontal lines, that the most probable mid-parentage of a man is one that deviates only one-third as much as the man does." In other words, if an adult child is above or below mean height, then their parents' heights will show regression towards the mean but by a smaller q value. The above quote shows that Galton estimated this at 1/3, compared with the 2/3 value when moving from parents to children. 

The fullest insights from the human stature data came, however, from a new form of visualisation. Galton plotted out, in two dimensions, the "smoothed" frequency data for the children's heights against the mid-parental heights. He put the number of observations that were found at each position on the graph. The main part of that figure (Galton, 1886b, Plate X) is shown below: 



Galton's own words describe what this visualisation allowed him to see, something that had eluded generations of great mathematicians and scientists who had previously considered the mathematics underpinning the relationship between two variables. He noticed "that lines drawn through entries of the same value formed a series of concentric and similar ellipses. Their common centre lay at the intersection of the vertical and horizontal lines, that corresponded to 68.25 inches. Their axes were similarly inclined [p. 254-5]." One of those ellipses is drawn in the above figure. Galton's earlier contributions in metereology (he created the first modern weather maps) probably helped him to conceive this way of plotting the data: he noted later in the paper (Galton, 1886b, p. 263) that these ellipses were just like "isobars".

Galton went further: "The points where each ellipse in succession was touched by a horizontal tangent, lay in a straight line inclined to the vertical in the ratio of 2/3; those where they were touched by a vertical tangent lay in a straight line inclined to the horizontal in the ration [sic] of 1/3. These ratios confirm the values of average regression already obtained by a different method, of 2/3 from mid-parent to offspring, and of 1/3 from offspring to mid-parent [p. 255, emphasis added]".

He drew a figure to illustrate this rather wordy desription, as shown below (Galton, 1886b; Plate X, Fig a)


The above Figure depicts a segment of the ellipse from the main plate along with its major axis. Line ON is the line which lies inclined to the vertical in the ratio 2/3 and line OM is the line inclined to the horizontal in the ratio 1/3. The dotted line OL is the line which would reflect the complete absence of any regression to the mean. If one extends the horizontal tangential line YN to meet line OL one also can see the 2/3 ratio again; similarly, by extending the vertical tangential line XM to meet OL one can see the 1/3 ratio once more. 

Galton commented on the compelling geometry, which he had empirically uncovered, in the following way: "These and other relations were evidently a subject for mathematical analysis and verification [p. 255]". But he did not carry out this mathematical analysis himself (see below). Galton seems to me to have been an intuitive thinker: his mathematical and statistical insights came from graphing his data very carefully, and by conceiving and building physical machines (such as the quincunx) that could generate and/or explain interesting data patterns. In Galton (1886b; Plate IX, right-hand panel) he sketched a mechanical device with pulleys and cords which would predict the height of a child from the heights of the parents. 

The level of Galton's formal maths skills seem more uncertain to me: he had studied maths as an undergraduate at Trinity College in Cambridge, but he did not complete his degree. In several papers he recruited the help of better mathematicians to show his conclusions in a more formal way. In this series of papers his mathematical collaborator was James Hamilton Dickson, whose brief Wikipedia page doesn't even mention the contributions he made to statistics through his collaboration with Galton. Stigler says Hamilton Dickson helped "a little with the mathematics at a late stage [p. 475]". IMO, I think his contribution was rather more than that. Hamilton Dickson wrote out his mathematical treatment in full in the Appendix to Galton (1886a). Even if posterity has rather forgotten about Hamilton Dickson's role in statistics, it is obvious Galton really appreciated his efforts, stating in Galton (1886b), and in several other places using identical phrasing, that he had "never felt such a glow of loyalty and respect towards the sovereignty and magnificent sway of mathematical analysis as when [Hamilton Dickson's] answer reached me, confirming, by purely mathematical reasoning, my various and laborious statistical conclusions with far more minuteness than I had dared to hope [p. 255]". Whatever one might think about Galton, given his advocacy of eugenics, for me this is an eloquent and touching statement about the power of maths in science.

With Hamilton Dickson's help, Galton squeezed yet more juice out of his stature data and presented the details in Galton (1886a). Before turning to consider that, however, it is worth noting that also Galton reported more empirical observations in this second 1886 paper. These data revealed that the regression relationship between the heights of adult men and their brothers was very similar to that which he had reported for adult children and their parents. This meant that the observed relationships were not something related to the processes involved in transmission of a feature between generations as he previously thought. Stigler (2010) notes that the 1886 papers removed the need for the implausible Darwinian natural selection processes that were part of the 1877 account. Stigler says: "Galton stated that he could now ‘get rid of all these complications’. In supreme irony, in what had started out as an attempt to mathematize the framework of the Origin of Species ended with the essence of that great work being discarded as unnecessary! [p. 475]"

Galton's remarkable graphs, shown above, are fairly simple to grasp; however, the mathematical details of these papers are hard work. This is, in no small part, because mathematicians of that era generally wrote about mathematical topics in a highly stylised way that doesn't map comfortably onto modern terminology and concepts. And some of the terminology that Galton employed has largely disappeared from use. What, for example, was this thing called "probable error" that he kept on referring to? Here I was assisted greatly by an Appendix from an as yet unpublished monumental biography of Galton, by Gavan Tredoux (he curates the Galton archive mentioned above). Tredoux's Appendix led me through the antiquarian maths of Galton and Hamilton Dickson so I was able to see clearly what they had concluded. Tredoux's Appendix can be read here. There are several juicy morsels in the maths (honestly), to which I will probably return in later blog entries.

Galton's (1886a) article, and especially Hamilton Dickson's Appendix to that paper, gives numerous important results about regression. Translating them into modern terminology with the help of Tredoux, I will focus on just the key points. Firstly, Hamilton Dickson showed that the frequency "isobars" in Galton's data plots were indeed expected to be ellipses, assuming that the height data of parents and offspring each followed a normal distribution. Hamilton Dickson gave the equations for the ellipses, and the horizontal and tangents to the ellipses. He plugged in 3 values estimated by Galton, who had derived his estimates by careful plotting. Hamilton Dickson was then able to compute the remaining numerical values that Galton had estimated. The details of the close correspondence between Galton's estimated and Hamilton Dickson's computed values are very clearly stated in Galton (1886b, p. 263); it was the closeness of the matching between data and mathematical theory that led to Galton's fulsome tribute, quoted above.

Having eventually digested the mathematics I wrote some computer code (currently in Matlab; available here) which allowed me to verify Hamilton Dickson's calculations and plot out the ellipses and tangents, using the formulae for a general ellipse (see here and here for details). A typical output figure from my code is shown below:-


The above plot is essentially the same (except for aspect ratio changes) as the figures from Galton's papers, reproduced above. I should add that my scale is arbitrary. I have plotted 4 ellipses as well as the red line connecting the origin at (0,0) to the points where the horizontal tangents (also in red) intersect the ellipses. The green line connects the origin to the points where the green vertical tangents intersect the ellipses.

The ellipse in magenta was drawn using Equation 7 of Hamilton Dickson's Appendix to Galton (1886a) and the three values estimated by Galton plugged into that equation. One estimate (1.22 inches) was a measure of the variation in the mid-parent height data (denoted a in Hamilton Dickson's equations). Another was a measure of the conditional variation in the offspring height after allowing for the influence of parental height (1.5 inches, and denoted b in Hamilton Dickson's equations). The third of Galton's estimates, denoted tan(theta) by Hamilton Dickson, was the slope (relative to the y-axis) of the red line joining the red tangents in the above figure. This reflects the regression of the offspring heights towards the mean of the parental heights; as already discussed, Galton estimated this value at 2/3.

As already mentioned, Hamilton Dickson used his ellipse equations, plugging in Galton's estimated values, to compute various other measures. These included: the ratio of the lengths of the long and short axes of the ellipses (1.87, according to Hamilton Dickson); and the angle that the long axes of the ellipses make with respect to the x-axis (around 27 degrees, according to Hamilton Dickson). Galton's graph-based estimates of these values were 1.96 and 25 degrees respectively.

My computer code used the Hamilton-Dickson values (1.87 and 27 degrees) to draw the 3 blue ellipses in the above figure. These are scaled, relative to the magenta ellipse, by constants which increase (or decrease) the length of the axes of the ellipses. The blue ellipses are nicely concentric with the magenta ellipse (which was drawn using Hamilton Dickson's ellipse equation 7). This shows that his calculations of the 1.87 ratio and the angle of 27 degrees were accurate! Note also that the green line from the origin does go through the contact points between the ellipses and the green vertical tangents. The gradient of the green line from the origin, relative to the x-axis, estimates the regression of the parent heights towards the mean of the offspring heights. The value from my computer code is 0.34 (matching Hamilton-Dickson again), and it was also very close to the value of 1/3 that Galton estimated, as noted earlier in this blog entry.

Stigler (2010), as usual, eloquently sums up what Galton had shown: "Galton had conceived of pairs of generations or other pairs of quantities as a true multivariate statistical object that could be sliced and diced, and examined both marginally and conditionally from any point of view, and this had a signal effect on the theory and practice of inference. This was a radically new statistical perspective, and it gave us a new type of question to ask, and a new way to think about statistical association: and, more fundamentally, a new way to think about inference [p. 480]".

Above, we have gone through the development of the concept of regression in gory detail, highlighting Galton's visualisation of the ellipses. These ellipses are the “isofrequency contours” of the bivariate normal distribution. From here it is a short step to the modern correlation coefficient itself. I find it a bit surprising (see also below) that Galton did not take that step in the 1886 papers. In fact, within a decade, work by Edgecombe and by Pearson brought into mainstream awareness the derivation and formula for what is now known as Pearson's product-moment correlation coefficient. In fact the derivation of the coefficient had actually been given 50 years earlier by a French physicist-crystallographer called Auguste Bravais. As regularly happens in science and maths, the first proponent of a major idea is very often not the person whose name becomes synonymous with that idea. This poignant observation is called Stigler's Law. It amuses me (at least) that Stigler named this law himself when he described it, presumably to stop a later historian of science getting the namecheck.

Stigler (1989) reflects on Galton's own account of how he happened upon the key ideas relating to the correlation coefficient (which he wrote in Galton, 1890). Galton had actually first coined the term correlation, for the statistical measure as we currently conceive it, in an earlier paper presented to the Royal Society (Galton, 1888), although he spelled it "co-relation" initially. In 1888 he also symbolised it with the letter r, which to this day is ubiquitously used to denote the sample value of the correlation. In all subsequent papers (e.g., Galton, 1890) Galton adopted the more familiar spelling.

Stigler (1989) argues that three further insights, which Galton lacked in 1886, prevented him from grasping the steps required to move from regression to correlation. As we have seen, Galton had stressed the observation that the regression slope for predicting children's height from that of their parents (2/3) was different from the slope for predicting parent's height from the height of their children (1/3). If he had standardised the variables then the regression slopes would have been the same in either direction, and would correspond to the formula for the correlation coefficient (see Rodgers and Nicewander, 1988, equation 3.1). Galton also did not realise in the mid 1880s that the regression slope between two variables (when they have been standardised) would be the neatest way to summarise the strength of the relationship between them. Stigler feels (as I do) that these two blockages holding back Galton's thinking were surprising, as all the pieces were available to him in 1885/6 and were implicit in the methods he used for regression. Stigler feels the third and most important blockage derived from Galton's focus on data from heredity problems which inevitably involved relating variables from one generation against the same variable in another generation. Galton failed at that time to see the wider applicability of his method of regression.

Galton's 1890 paper opens with another great quote, which describes the effect of the removal of the third conceptual blockage identified by Stigler. The quote states the joy he felt in 1888 when he saw that he could break out from the confines of working with problems of heredity. Galton wrote: "Few intellectual pleasures are more keen than those enjoyed by a person who, while he is occupied in some special inquiry, suddenly perceives that it admits of a wide generalization, and that his results hold good in previously-unsuspected directions [p. 419]."

As usual, Galton's inspiration came from working with datasets he had painstakingly collected. He was using the data investigate two problems that he had come across through his interests in anthropology and forensic science: how to estimate the height of an unknown man from the length of one of his bones; and how to capture the relationship between the bodily dimensions of the same person. Galton realised that these two new problems, and the kinship problems he had already analysed, were "no more than cases of a much more general problem--namely, that of Correlation [p. 420]."

Galton's 1890 paper was an invited piece written for a generalist publication, the North American Review (NAR). Stigler (1989) gives a lovely account of the protracted exchange of letters between the editor of the NAR, Thorndike Rice, and Galton, who was very reluctant to write a piece for the NAR. Rice's persistence won through, and the article eventually appeared. Because it was written for the general reader, this might explain why it is (for me) Galton's clearest statistical paper. He illustrates the concept of correlation through "a succession of examples, rather than by a formal definition". My favourite is the scenario of two clerks who leave work at the same time and catch the same "somewhat unpunctual omnibus". They get off at the same stop and walk to their respective homes. Galton argues that their arrival times at home will be related, because they will depend in part upon the lateness and slowness of the bus, and the relationship will be especially strong if they both live close to the stop where they were dropped off.  This is a nice example of a latent common cause (bus travel) causing correlation between two observables (the “getting home” times of clerk A and clerk B). I may return to this example in a subsequent blog entry.

Galton (1890) referred explicitly to variables that conform to a normal distribution (ones which show "the normal form of variability") and stressed how a measure of dispersion ("probable error" in Galton's day, standard deviation or variance today) was vitally important in characterising such variables. Galton also continued to work with centred variables, i.e. ones with the sample mean subtracted from each value. He put everything together explicitly in the 1890 paper using an example of trying to relate the length of a person's left middle-finger to their height ("stature"). He reported his analyses: specifically that a 1 inch deviation from the mean in finger length corresponded to a 8.19 inch deviation from the mean in height, and a 1 inch deviation in height was associated with a 0.06 inch deviation in finger length. In summary Galton said: "There is no numerical reciprocity in these figures, because the scales of dispersion of the lengths of the finger and of the stature differ greatly, being in the ratio of 15 to 175. But the 6 hundredths multiplied into the fraction of 175 divided by 15, and the 819 hundredths multiplied into that of 15 divided by 175, concur in giving the identical value of 7 tenths, which is the index of their correlation [p.430, emphasis added]." In this moment of inspiration Galton conceived of standardising the variables (into what we would now call z-scores) and generating the correlation coefficient as we know it today. His calculations reflect one of the many formulae one can use to compute a correlation coefficient (specifically Equation 3.1 in Rodgers & Nicewander, 1988).

Doing this research on the history of regression and correlation has been quite revelatory to me, even though I knew something of it already. I have always had admiration for the facility possessed by some people, such as Galton, through which they are able to conceive of mathematical concepts visually (or even mechanically, in Galton's case). I am more of a verbal plodder who likes to work through sets of equations line by line to get to the final result (and then check them with a bespoke computer programme, as above). However, there have been widely-used visualisations in statistics that I have found to be unhelpful and even misleading. One of these "visualisation bugbears" of mine relates directly to these historical insights into regression and correlation. In a later blog entry I will illustrate how this particular duff data visualisation, known as a ballantine, might never have come about if Galton's insights had been kept in mind.



References

Galton, F. (1877). Typical laws of heredity. Proceedings of the Royal Institution of Great Britain, 8, 282–301.

Galton, F. (1885). Presidential Address, Section H, Anthropology. Report of the British Association for the Advancement of Science, 55, 1206–1214.

Galton, F. (1886a). Family likeness in stature. Proceedings of the Royal Society, 40, 42–63.

Galton, F. (1886b). Regression towards mediocrity in hereditary stature. The Journal of the Anthropological Institute of Great Britain and Ireland, 15, 246–263.

Galton, F. (1888). Co-relations and their measurement, chiefly from anthropometric data. Proceedings of the Royal Society, 45, 135–145.

Galton, F. (1890). Kinship and correlation. The North American Review, 150 , 419–431.

Ozer, D. J. (1985). Correlation and the Coefficient of Determination. Psychological Bulletin, 97(2), 307-315.

Rodgers, J. L., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1), 59–66.

Stigler, S. M. (1989). Francis Galton’s account of the invention of correlation. Statistical Science, 4(2), 73–79.

Stigler, S. M. (2010). Darwin, Galton and the statistical enlightenment. 
Journal of the Royal Statistical Society. Series A173(3), 469–482.

Tredoux, G. (2019). The Mathematics of Natural Inheritance (1889)https://galton.org/tredoux/tredoux-2019-math-nat-inheritance.pdf 
 





Comments

  1. My computer code is currently available only in Matlab but I will create a version for the free Matlab-clone known as Octave

    ReplyDelete

Post a Comment

Popular posts from this blog

Fixing a really ugly model: A post-script to our demo of simple Bayesian inference

Using Bayesian inference to predict fishing success: It was surprisingly effective.