« Competitive programming in Haskell: folding challenge » Competitive programming in Haskell: folding folds

Average and median via optimization

Posted on June 8, 2021
Tagged absolute value, average, derivative, median, optimization, math

This is certainly not a new observation, and I’m sure it is folklore and/or contained in various textbooks, but it was amusing to derive it independently.

Suppose we have a finite set of real numbers \(\{x_1, \dots, x_n\}\), and we want to pick a value \(m\) which is somehow “in the middle” of the \(x_i\). The punchline is that

if we want to minimize the sum of the squared distances from \(m\) to each \(x_i\), we should pick \(m\) to be the average of the \(x_i\);
if we want to minimize the sum of the absolute distances from \(m\) to each \(x_i\), we should pick \(m\) to be the median of the \(x_i\).

The first of these is tricky to understand intuitively but easy to derive; the second is intuitively straightforward but trying to derive it leads to an interesting twist.

Average = minimizing sum of squared distances

Let’s not worry about why we would want to minimize the sum of squared distances; there are good reasons and it’s not the point. I don’t know about you, but I find it difficult to reason intuitively about how and why to pick \(m\) to minimize this sum of squared differences. If you know of an intuitive way to explain this, I would love to hear about it! But in any case, it is easy to derive using some strightforward calculus.

Let \(\displaystyle S(m) = \sum_i(m - x_i)^2\) denote the sum of squared distances from a given \(m\) to each of the \(x_i\). Taking the derivative of \(S\) with respect to \(m\), we find

\(\displaystyle \frac{d}{dm} S(m) = \sum_i 2(m - x_i)\).

Setting the derivative equal to zero, we can first divide through by the factor of 2, yielding

\(\displaystyle 0 = \sum_i (m - x_i)\)

Since \(m\) does not depend on \(i\), this is just \(n\) copies of \(m\) less the sum of the \(x_i\). Hence, solving for \(m\) yields

\(\displaystyle m = \frac{1}{n} \sum_i x_i\)

as expected: the value of \(m\) which minimizes the sum of squared distances to the \(x_i\) is their average, that is, the sum of the \(x_i\) divided by the size of the set.

Median = minimizing sum of absolute distances

Now suppose we want to minimize the sum of absolute distances instead, that is,

\(S(m) = \sum_i |m - x_i|\)

In this scenario, it is much easier to reason out the correct answer. Start with some arbitrary \(m\), and imagine nudging it by some small amount \(\Delta x\), say, to the right. \(m\)’s distances to any points on its left will each increase by \(\Delta x\), and its distances to any points on its right will each decrease by the same amount. Therefore, if there are more \(x_i\) to the left of \(m\), then the overall sum of distances distances will increase; if there are more \(x_i\) to the right, then the overall sum will decrease. So, to find \(m\) which minimizes the sum of absolute differences, we want the same number of \(x_i\) on the left and the right, that is, we want the median. Note that if \(n\) is odd, then we must pick \(m\) to be exactly equal to the \(x_i\) in the middle; if \(n\) is even, then we can pick \(m\) to be anywhere inside the interval between the middle two \(x_i\).

Just for fun, can we derive this answer using calculus, like we did for minimizing squared differences? There is a wrinkle, of course, which is that the absolute value function is not differentiable everywhere: it has a sharp corner at zero. But we won’t let that stop us! Clearly the derivative of \(|x|\) is \(-1\) when \(x < 0\) and \(1\) when \(x > 0\). So it seems reasonable to just assign the derivative a value of \(0\) at \(x = 0\). Algebraically, we can define

\(\displaystyle \frac{d}{dx} |x| = [x > 0] - [x < 0]\)

where \([P]\) is equal to \(1\) when the proposition \(P\) is true, and \(0\) when it is false (this notation is called the Iverson bracket). So when \(x > 0\) we get \([x > 0] - [x < 0] = 1 - 0 = 1\); when \(x < 0\) we get \(0 - 1 = -1\); and when \(x = 0\) both propositions are false so we get \(0 - 0 = 0\).

Armed with this definition, we can differentiate \(S\) with respect to \(m\):

\(\displaystyle \frac{d}{dm} S(m) = \frac{d}{dm} \sum_i |m - x_i| = \sum_i [m > x_i] - \sum_i [m < x_i]\)

Clearly, this is zero when \(\displaystyle \sum_i [m > x_i] = \sum_i [m < x_i]\), that is, when there are the same number of \(x_i\) on either side of \(m\).

The curious thing to me is that even though the derivative of \(|x|\) is undefined when \(x = 0\), it seems like it “wants” to be 0 here. In general, if we assign the value \(k\) to the derivative at \(x = 0\), the derivative of \(S\) becomes

\(\displaystyle \frac{d}{dm} S(m) = \sum_i [m > x_i] + k \sum_i [m = x_i] - \sum_i [m < x_i]\)

When \(k\) is nonzero and \(n\) is odd, there are no values of \(m\) for which this derivative is zero, making it more difficult to find the minimum.