Seeing History
Correlation

Correlation has a common meaning: how well things match up with each other. In statistics, the concept acquires greater precision, and also greater interpretive power. It works like this: Of two measurable quantities associated with n different objects, correlation tells us how closely changes in one quantity are predictable from changes in the other quantity. If we measure the height and the annual income of 18 people, we can calculate how far income is a function of height in that sample. Take another sample:

The Streets of Islington

You are the Mayor of Islington. The Mothers of Islington come to you in a body, demanding that more open space be created, since out of 18 districts of London, Islington ranks at the bottom for percentage of open space (variable x, Sp), but near the top for percentage of accidents that are accidents to children (variable y, Ac). They thrust in your face these figures, for Islington and the other 17 districts.

1
2
3
4
5
6
7
8
District
x (Sp)
y (Ac)
Bermondsey
0.050
0.463
Camberwell
0.052
0.336
Deptford
0.022
0.434
Finsbury
0.020
0.388
Fulham
0.042
0.422
Hammersmith
0.122
0.283
Hampstead
0.148
0.171
Islington
0.013
0.429
Marylebone
0.236
0.178
Paddington
0.072
0.336
Poplar
0.045
0.370
Shoreditch
0.014
0.400
Southwark
0.031
0.333
Stepney
0.025
0.374
Stoke Newington
0.063
0.308
Wandsworth
0.146
0.238
Westminster
0.275
0.108
Woolwich
0.070
0.382

How to make sense of them? You mentally calculate the average value, or mean, of the x column, and then subtract that average (mx, the mean of x) from each individual value of x to make the high and low values more obvious. The result is a column of (x-mx) figures. You then do the same to y. You have:

1
2
3
4
5
6
7
8
District
x (Sp)
y (Acc)
(x-mx)
(y-my)
Bermondsey
0.050
0.463
-0.030
+0.132
Camberwell
0.052
0.336
-0.028
+0.005
Deptford
0.022
0.434
-0.058
+0.103
Finsbury
0.020
0.388
-0.060
+0.057
Fulham
0.042
0.422
-0.038
+0.091
Hammersmith
0.122
0.283
+0.042
-0.048
Hampstead
0.148
0.171
+0.068
-0.160
Islington
0.013
0.429
-0.067
+0.098
Marylebone
0.236
0.178
+0.156
-0.153
Paddington
0.072
0.336
-0.008
+0.005
Poplar
0.045
0.370
-0.035
+0.039
Shoreditch
0.014
0.400
-0.066
+0.069
Southwark
0.031
0.333
-0.049
+0.002
Stepney
0.025
0.374
-0.055
+0.043
Stoke Newington
0.063
0.308
-0.017
-0.023
Wandsworth
0.146
0.238
+0.066
-0.093
Westminster
0.275
0.108
+0.195
-0.223
Woolwich
0.070
0.382
-0.010
+0.051
SUM (S)
1.446
5.953
mean (S/18)
0.080
0.331

Hmm, you say to yourself, look at that zigzag pattern. The minus or below-average values of x do seem to match the plus or above-average values of y. There is clearly something going on here.

We now want to define the relationship more exactly. To do this, we could calculate what is called the correlation coefficient, which is not intuitively meaningful, or we could go for the coefficient of determination (D), which is intuitively meaningful. Let's go for D. To get it, we first multiply together each pair of (x-mx) and (y-my) terms, and put the result in column 6. Then we square each of those terms, and put the results in columns 7 and 8. The result, accomplished by an unobtrusive secretary while you distract the Mothers of Islington with small talk, would look like this:

1
2
3
4
5
6
7
8
District
x (Sp)
y (Acc)
(x-mx)
(y-my)
(x-mx)(y-my)
(x-mx)²
(y-my)²
Bermondsey
0.050
0.463
-0.030
+0.132
-0.003960
0.000900
0.017424
Camberwell
0.052
0.336
-0.028
+0.005
-0.000140
0.000784
0.000025
Deptford
0.022
0.434
-0.058
+0.103
-0.005974
0.003364
0.010609
Finsbury
0.020
0.388
-0.060
+0.057
-0.003420
0.003600
0.003249
Fulham
0.042
0.422
-0.038
+0.091
-0.003458
0.001444
0.008281
Hammersmith
0.122
0.283
+0.042
-0.048
-0.002016
0.001764
0.002304
Hampstead
0.148
0.171
+0.068
-0.160
-0.010880
0.004624
0.025600
Islington
0.013
0.429
-0.067
+0.098
-0.006566
0.004489
0.009604
Marylebone
0.236
0.178
+0.156
-0.153
-0.023868
0.024336
0.023409
Paddington
0.072
0.336
-0.008
+0.005
-0.000040
0.000064
0.000025
Poplar
0.045
0.370
-0.035
+0.039
-0.001365
0.001225
0.001521
Shoreditch
0.014
0.400
-0.066
+0.069
-0.004554
0.004356
0.004761
Southwark
0.031
0.333
-0.049
+0.002
-0.000098
0.002401
0.000004
Stepney
0.025
0.374
-0.055
+0.043
-0.002365
0.003025
0.001849
Stoke Newington
0.063
0.308
-0.017
-0.023
+0.000391
0.000289
0.000529
Wandsworth
0.146
0.238
+0.066
-0.093
-0.006138
0.004356
0.008649
Westminster
0.275
0.108
+0.195
-0.223
-0.043485
0.038025
0.049729
Woolwich
0.070
0.382
-0.010
+0.051
-0.000510
0.000100
0.002601
SUM (S)
1.446
5.953
a = -0.118446
b = 0.099146
c = 0.170173
mean (S/18)
0.080
0.331

The boxes are now all filled in (note that mx means "the mean of the x values" and so on), and we can proceed to calculate our answer. It's very easy. All we need are the three sums here called a, b, and c. We substitute them into the following formula:

D = a² / bc

getting

D = 0.014029454 / (0.099146)(0.170173) = 0.014029454 / 0.016871972 = 0.8315

And we can then prefix the sign of "a" (which here is minus) to the final result, to remind us that the correlation is inverse: as one quantity goes up, the other tends to go down. Thus, finally,

D = -0.8315

The D value tells us, as a fraction, how much one of our categories can be predicted from other category. In this case, it is 83%. Then the proportion of accidents which are accidents to children is 83% associated with the amount of open space. It is a pretty high level, but still, it leaves some room for politics. Here come the politics:

Borough Commander Barry Norman (Click for Islington Police)

"Ladies," you then say, "I congratulate you on gathering these figures; they do indeed show a relationship between open space and proportion of accidents to children. About five-sixths of accidents to children can be accounted for by the proportion of open space. But one-sixth cannot be so explained, and surely (eyeing briefly every sixth mother in the room) we would not wish to abandon one-sixth of our children to their fate. Perhaps we should look at it even more closely. If you will ask the Police Commander for a map of the borough showing where accidents to children have occurred, and I will have my secretary request his cooperation, we can pinpoint the exact locations of these lamentable accidents, and take precise and effective steps to prevent them."

And as they head for the doorway, you add silently, Or you could move to Westminster.

Calligraphic Separator

Just for the record, if you want to get instead the classical correlation coefficient C in the example above, it is the square root of D, which in this case would be:

C = 0.9119

The standard textbooks will tell you, by means of a series of tables in the back, how to interpret this.

Readings

 

To Next

17 Dec 2006 / Contact The Project / Exit to Outline Index Page