The Pearson product-moment correlation coefficient (PMCC) is a quantity between -1.0 and 1.0 that estimates the strength of the linear relationship between two random variables.

The PMCC in its usual form is somewhat cumbersome to calculate. Using simple algebra, I have rearranged it to form an expression that should have better numerical stability and require fewer calculations.

*Disclaimer:* This page is primarily for my own reference. I am a
programmer without formal training in statistics, and I don't even
*feel* like I know what I'm doing. There are probably a ton of
assumptions that I am unwittingly making, I am almost certainly misusing the
terminology, and I could simply be flat-out wrong here. My apologies to
statisticians and to people like Zed Shaw who
have a far greater understanding of this stuff than I do. If you blindly
trust this page while building something important---even after what I just
told you---then the blame is all yours. Use your brain. Don't believe
everything you read. This is not even a very interesting article: It's
mostly just algebra. Don't read this page; It's a waste of your time.

*On a more serious note:* In an attempt to make this article less
cringe-worthy, I made an effort to find the original peer-reviewed article(s)
where the Pearson correlation might be defined precisely, but nothing I read
cited primary references (MathWorld
just cited textbooks, for example) and I don't have the money to buy
expensive journal articles for every little web page I write. After
searching for most of a day, I finally gave up in frustration and decided to
post this article anyway, flaws and all. If this article makes you cringe,
please consider doing something to advance the principle of open access.

Imagine we have two populations *X* and *Y*. Then
ρ_{X,Y} represents the *product-moment coefficient of correlation* between
them.

Various websites and textbooks describe the correlation coefficient in several equivalent ways:

- As the ratio between the covariance of
*X*and*Y*and the product of their standard deviations: - As the sum of the products of each pair of standard scores of the
*X*and*Y*values, all divided by the number of degrees of freedom:

We often can't work with populations directly, so we can't determine the
exact value of ρ_{X,Y}. However, we can estimate it by
selecting a random sample of (x,y) pairs. This estimate is often labelled
*r*. Since we can use the same formula for either case, I call the
general formula *P**M**C**C*(*X*,*Y*).

Let *P**M**C**C*(*X*,*Y*) be the Pearson product-moment correlation
coefficient of two *n*-dimensional vectors *X* = {*X*_{1},*X*_{2},...,*X*_{n}} and
*Y* = {*Y*_{1},*Y*_{2},...,*Y*_{n}}.

Let and be the arithmetic
means of the elements in *X* and *Y*, respectively.

Let *s*_{X} and *s*_{Y} be the standard deviations of
*X* and *Y*, respectively.

Then, the following relations apply. Note how we define *N* in
order to avoid having to do two separate analyses for population and
sample data:

, | ||

, |

Given the vectors *X* and *Y*, there are a few things we can calculate right away:

The sums of all the elements in each vector: | ||

, | ||

The squares of the sums of all the elements in each vector: | ||

, | ||

The sums of the squares of all the elements in each vector: | ||

, | ||

The sum of the products of the corresponding elements from each vector: | ||

We can now reduce the arithmetic means in terms of our previous calculations:

, |

To simplify the standard deviations, we first reduce the squares of the deviations:

Then, we simplify the variance:

Therefore, the reduced standard deviations are:

, |

Recall the PMCC formula:

and the results we derived:

, | ||

, |

Performing substitution, we get:

Notice how the formula no longer depends on whether the data is from a population or from a sample.

Let's reduce the summation from the previous section:

Substituting, we get:

Given two populations or samples *X* = {*X*_{1},*X*_{2},...,*X*_{n}} and
*Y* = {*Y*_{1},*Y*_{2},...,*Y*_{n}} (and subject to some assumptions about the distributions of the data), the Pearson product-moment correlation coefficient of the two is given by:

where the following variables are defined:

, | ||

, | ||

, | ||

Alternatively, we can use the expanded form (which is obtained by applying the previous substitutions):

When computing the sum of several floating-point values that vary widely, you can obtain a better approximation (less round-off error) by sorting the terms in ascending order before adding them together. That way, the smaller numbers will be added together before being added to larger numbers, rather than being immediately truncated. The Wikipedia article on numerical stability has a bit more information about this.