The Gini coefficient for distribution inequality

Posted by David Zaslavsky on November 25, 2012 11:41 PM — Edited January 28, 2013 8:14 PM

— Comments

As we go out shopping for gifts this holiday season, given the state of the economy, a lot of people will be thinking about how to get the best value from their gift budget. A lot more people than usual, in fact, because as you’ll hear on TV or read online from time to time, the income gap in this country is exceptionally large.

It’s probably common knowledge that a large income gap means roughly a large difference between the richest and poorest income levels. But that’s not a very precise statement by itself. Suppose you have two tiny countries of six people each, and their incomes are distributed like this:

Omnomnomia	Lolistan
$15,093	$15,093
$21,259	$29,947
$27,425	$55,508
$33,591	$55,508
$57,129	$81,069
$95,923	$95,923

The difference between richest and poorest is the same in both countries, but the other values are significantly higher in Lolistan. We need to calculate something that takes into account everyone’s income, not just the extremes.

OK, how about the standard deviation? That’s the usual way to characterize how widely a bunch of numbers are distributed.

	Omnomnomia	Lolistan
Standard deviation	$27,608.81	$27,608.80

Huh. They’re the same. While you can’t argue with the math, it does suggest that the interpretation may not be right: Omnomnomia just intuitively seems to have a larger income gap than Lolistan, so perhaps standard deviation isn’t the right metric to measure that.

Fortunately, probability theory provides an answer. The Gini coefficient is a precise mathematical way to quantify the inequality of a distribution. It allows you to place any given distribution on a spectrum from many copies of the same value ($G = 0$) to one nonzero value and the rest zero ($G = 1$). For a discrete set of $n$ values (like incomes), you can calculate it using the formula

$$G = 1 + \frac{1}{n} - \frac{2 \sum_k r_k x_k}{n\sum_k x_k}$$

where $x_k$ is the value and $r_k$ is its rank in decreasing order. The largest $x_k$ has a rank of $r_k = 1$, the second largest as $r_k = 2$, and so on.

Rank	Income (Omnomnomia)	Income (Lolistan)
6	$15,093	$15,093
5	$21,259	$29,947
4	$27,425	$55,508
3	$33,591	$55,508
2	$57,129	$81,069
1	$95,923	$95,923
Gini coefficients	(Omnomnomia)	(Lolistan)
	0.345	0.279

Finally, this number shows a distinct difference between the two countries! As expected, Omnomnomia’s income gap, as measured by the Gini coefficient, is larger.

I like this calculation because there are a couple of pretty intuitive (I think) ways to understand it that explain why it’s a good measure of inequality. First, the Gini coefficient is half of the relative mean difference, which in turn is just the average difference between two values. So overall, the larger the differences between values in a set of numbers, the larger its Gini coefficient will be.

Another way to interpret the coefficient, which makes more sense for continuous distributions, is to think of it as the area under the Lorenz curve, as a fraction of the maximum possible area it could have.

The Lorenz curve is the plot of the function which tells you how much of the value the bottom so-many percent of the population has in total — for example, if you hear a statistic on the news saying the bottom 20% of people in America earn only 1% of the income, that means the value of the Lorenz curve at $x = 0.2$ is $L(0.2) = 0.01$. If everyone has the same income, like in a perfect socialist society, then the bottom x% of the population will always have x% of the income, which would give the maximum possible area under the Lorenz curve.

Bonus: enjoy this Python function which calculates the Gini coefficient of a NumPy ndarray:

def gini_coeff(x):
    # requires all values in x to be zero or positive numbers,
    # otherwise results are undefined
    n = len(x)
    s = x.sum()
    r = argsort(argsort(-x)) # calculates zero-based ranks
    return 1 - (2.0 * (r*x).sum() + s)/(n*s)