Goodhart's Law & Data Science

Much of data science is optimizing a metric, so it's no surprise that Goodhart's Law creeps up from time to time in data science projects.

I first came across Goodhart’s Law when I was looking for resources to help me understand more about metrics. There were several good reads & most of them make a mention of Goodhart’s Law. This post is my attempt of organizing my thoughts on it (with sprinkles of reflection from my past/current work here & there).

What is Goodhart’s Law?

Goodhart’s Law, named after British economist Charles Goodhart, via anthropologist Marilyn Stathern:

“When a measure becomes a target, it ceases to be a good measure.”

The most often cited example is probably the Cobra story that is said to occur in British India: there were a lot of cobras in Delhi, so the government offered a bounty for every dead cobra. A number of snakes were killed because people are after the bounty, and there were fewer cobras. People began to think that they can just breed cobras, kill them and bring the dead cobras to the government for the money. The government became aware of this, so they scrapped the program, and as the (intentionally bred) cobras were released, they ended up with even more cobras than before.

If you think about it you might be already aware of some other examples—you’re probably just not aware that there is a law that describes it and it’s called the Goodhart’s Law:

  • We have long used exam scores to measure intelligence. Expectation: students actually study & internalize whatever they’re studying. Reality: students end up studying to memorize things well enough until the day of the exams. Consequence: the metric of exam scores is no longer useful to measure a student’s intelligence.
  • Universities prefer schools that have the better national exam scores. Expectation: schools improve teaching quality. Reality: some schools end up gaming (e.g. allowing cheating) in exams so that they’re considered as good schools & are more preferred by universities. Consequence: exam scores no longer tell you the quality of a school.
  • A news website has a success metric of X page views per day. Expectation: their writers write engaging, high-quality stories. Reality: they write sensationalist stories that might or might not be true. Consequence: due to the low quality stories, the brand of the news website is tarnished, & people no longer take it seriously.

What it means for data science

Much of data science is optimizing a metric, from picking a model with the lowest error rate to minimizing cost in model training. Thus it’s no surprise that we will come across the effect of Goodhart’s Law in the data science process in some ways.

Say that we’ve deployed our model: yay! Our model does not perform as well as how it performed in our test data: sad face. There are a few possibilities:

  1. We didn’t have enough training data
  2. There are external changes (e.g. changing market condition)
  3. The behavior of the data is altered due to our model’s presence

Point (3) is where Goodhart’s Law comes into play.

Regarding (3): we need to be aware of our model’s feedback loops, & watch out for them. I think it’s important to document your model’s feedback loops early on—from the one we deliberately design to collect feedback to improve our model to negative feedback loops. Think of any possible ways that users’ behaviors (or whatever it is you’re building your model for) might change once our model is in place.

At this point maybe Goodhart’s Law is inevitable: people will adjust their behaviors somewhat. Despite of that, it’s still important to be aware of it (e.g. expect the performance on production data to be worse) & make plans for it.

Dealing with Goodhart’s Law in data science

The following are my takeaways:

  • We need to understand that a metric is merely a proxy. It should not trump common sense & your understanding of the data.
  • Resist the temptation of “the one true metric that rules all”. Consider using several metrics instead of a single metric. I guess you can say having both Precision and Recall, depending on the use case, is one of the ways to mitigate the shortcomings of Accuracy. Another example is having bounce rate as a supplement to CTR. Depending on your data and use case, you may want to extend the dimensionality further too.

    • I think it’s important to think about this from the start, even from the development phase, because this can impact a) the granularity of data you need to process & track b) how you pick the best model.
    • A single metric may be tempting because it’s easy to explain, and I think it’s also important to have metrics that are easy to explain to others. However if you resist on using a single metric though you may be hiding important dimensionality that people might care about. A possible solution is have a couple of important metrics that you often cite to stakeholders, but in the background (e.g. internal monitoring dashboard) do keep track of more metrics.
    • Another argument for multiple metrics is that they are harder to game at once.
  • Pair quantity measurements with quality measurements. I think much of this goes back to using common sense and also engaging with the users to understand what they actually care about.
  • Invest in good monitoring & if possible, explainable models. In some cases, Goodhart’s Law might be inevitable. Even if it’s inevitable it’s still important to be able to track how much drift is possibly caused by Goodhart’s Law (good monitoring will help this) & which parts of your model are affected (having an explainable model will help this). This two information can help inform us what next steps we should take—do we need to refresh our model? How often? Are there particular features that contribute too much to the (perhaps negative) feedback loop & does it make sense to keep them?

References

The following resources helped me a lot to understand Goodhart’s Law & metrics in general:

August 01, 2020

Data Science