Seeing History
Correlation

Correlation has a common meaning: how well things match up with each other. In statistics, the concept acquires greater precision, and also greater interpretive power. It works like this: Of two measurable quantities associated with n different objects, correlation tells us how closely changes in one quantity are predictable from changes in the other quantity. If we measure the height and the annual income of 18 people, we can calculate how far income is a function of height in that sample. Take another sample:

You are the Mayor of Islington. The Mothers of Islington come to you in a body, demanding that more open space be created, since out of 18 districts of London, Islington ranks at the bottom for percentage of open space (variable x, Sp), but near the top for percentage of accidents that are accidents to children (variable y, Ac). They thrust in your face these figures, for Islington and the other 17 districts.

 1 2 3 4 5 6 7 8 District x (Sp) y (Ac) Bermondsey 0.050 0.463 Camberwell 0.052 0.336 Deptford 0.022 0.434 Finsbury 0.020 0.388 Fulham 0.042 0.422 Hammersmith 0.122 0.283 Hampstead 0.148 0.171 Islington 0.013 0.429 Marylebone 0.236 0.178 Paddington 0.072 0.336 Poplar 0.045 0.370 Shoreditch 0.014 0.400 Southwark 0.031 0.333 Stepney 0.025 0.374 Stoke Newington 0.063 0.308 Wandsworth 0.146 0.238 Westminster 0.275 0.108 Woolwich 0.070 0.382

How to make sense of them? You mentally calculate the average value, or mean, of the x column, and then subtract that average (mx, the mean of x) from each individual value of x to make the high and low values more obvious. The result is a column of (x-mx) figures. You then do the same to y. You have:

 1 2 3 4 5 6 7 8 District x (Sp) y (Acc) (x-mx) (y-my) Bermondsey 0.050 0.463 -0.030 +0.132 Camberwell 0.052 0.336 -0.028 +0.005 Deptford 0.022 0.434 -0.058 +0.103 Finsbury 0.020 0.388 -0.060 +0.057 Fulham 0.042 0.422 -0.038 +0.091 Hammersmith 0.122 0.283 +0.042 -0.048 Hampstead 0.148 0.171 +0.068 -0.160 Islington 0.013 0.429 -0.067 +0.098 Marylebone 0.236 0.178 +0.156 -0.153 Paddington 0.072 0.336 -0.008 +0.005 Poplar 0.045 0.370 -0.035 +0.039 Shoreditch 0.014 0.400 -0.066 +0.069 Southwark 0.031 0.333 -0.049 +0.002 Stepney 0.025 0.374 -0.055 +0.043 Stoke Newington 0.063 0.308 -0.017 -0.023 Wandsworth 0.146 0.238 +0.066 -0.093 Westminster 0.275 0.108 +0.195 -0.223 Woolwich 0.070 0.382 -0.010 +0.051 SUM (S) 1.446 5.953 mean (S/18) 0.080 0.331

Hmm, you say to yourself, look at that zigzag pattern. The minus or below-average values of x do seem to match the plus or above-average values of y. There is clearly something going on here.

We now want to define the relationship more exactly. To do this, we could calculate what is called the correlation coefficient, which is not intuitively meaningful, or we could go for the coefficient of determination (D), which is intuitively meaningful. Let's go for D. To get it, we first multiply together each pair of (x-mx) and (y-my) terms, and put the result in column 6. Then we square each of those terms, and put the results in columns 7 and 8. The result, accomplished by an unobtrusive secretary while you distract the Mothers of Islington with small talk, would look like this:

 1 2 3 4 5 6 7 8 District x (Sp) y (Acc) (x-mx) (y-my) (x-mx)(y-my) (x-mx)² (y-my)² Bermondsey 0.050 0.463 -0.030 +0.132 -0.003960 0.000900 0.017424 Camberwell 0.052 0.336 -0.028 +0.005 -0.000140 0.000784 0.000025 Deptford 0.022 0.434 -0.058 +0.103 -0.005974 0.003364 0.010609 Finsbury 0.020 0.388 -0.060 +0.057 -0.003420 0.003600 0.003249 Fulham 0.042 0.422 -0.038 +0.091 -0.003458 0.001444 0.008281 Hammersmith 0.122 0.283 +0.042 -0.048 -0.002016 0.001764 0.002304 Hampstead 0.148 0.171 +0.068 -0.160 -0.010880 0.004624 0.025600 Islington 0.013 0.429 -0.067 +0.098 -0.006566 0.004489 0.009604 Marylebone 0.236 0.178 +0.156 -0.153 -0.023868 0.024336 0.023409 Paddington 0.072 0.336 -0.008 +0.005 -0.000040 0.000064 0.000025 Poplar 0.045 0.370 -0.035 +0.039 -0.001365 0.001225 0.001521 Shoreditch 0.014 0.400 -0.066 +0.069 -0.004554 0.004356 0.004761 Southwark 0.031 0.333 -0.049 +0.002 -0.000098 0.002401 0.000004 Stepney 0.025 0.374 -0.055 +0.043 -0.002365 0.003025 0.001849 Stoke Newington 0.063 0.308 -0.017 -0.023 +0.000391 0.000289 0.000529 Wandsworth 0.146 0.238 +0.066 -0.093 -0.006138 0.004356 0.008649 Westminster 0.275 0.108 +0.195 -0.223 -0.043485 0.038025 0.049729 Woolwich 0.070 0.382 -0.010 +0.051 -0.000510 0.000100 0.002601 SUM (S) 1.446 5.953 a = -0.118446 b = 0.099146 c = 0.170173 mean (S/18) 0.080 0.331

The boxes are now all filled in (note that mx means "the mean of the x values" and so on), and we can proceed to calculate our answer. It's very easy. All we need are the three sums here called a, b, and c. We substitute them into the following formula:

D = a² / bc

getting

D = 0.014029454 / (0.099146)(0.170173) = 0.014029454 / 0.016871972 = 0.8315

And we can then prefix the sign of "a" (which here is minus) to the final result, to remind us that the correlation is inverse: as one quantity goes up, the other tends to go down. Thus, finally,

D = -0.8315

The D value tells us, as a fraction, how much one of our categories can be predicted from other category. In this case, it is 83%. Then the proportion of accidents which are accidents to children is 83% associated with the amount of open space. It is a pretty high level, but still, it leaves some room for politics. Here come the politics:

"Ladies," you then say, "I congratulate you on gathering these figures; they do indeed show a relationship between open space and proportion of accidents to children. About five-sixths of accidents to children can be accounted for by the proportion of open space. But one-sixth cannot be so explained, and surely (eyeing briefly every sixth mother in the room) we would not wish to abandon one-sixth of our children to their fate. Perhaps we should look at it even more closely. If you will ask the Police Commander for a map of the borough showing where accidents to children have occurred, and I will have my secretary request his cooperation, we can pinpoint the exact locations of these lamentable accidents, and take precise and effective steps to prevent them."

And as they head for the doorway, you add silently, Or you could move to Westminster.

Just for the record, if you want to get instead the classical correlation coefficient C in the example above, it is the square root of D, which in this case would be:

C = 0.9119

The standard textbooks will tell you, by means of a series of tables in the back, how to interpret this.