Goodhart's Law & Data Science

— 8 minute read

I first came across Goodhart's Law when I was looking for resources to help me understand more about metrics. There were several good reads & most of them make a mention of Goodhart's Law. This post is my attempt of organizing my thoughts on it (with sprinkles of reflection from my past/current work here & there).

What is Goodhart's Law?

Goodhart's Law, named after British economist Charles Goodhart, via anthropologist Marilyn Stathern:

"When a measure becomes a target, it ceases to be a good measure."

The most often cited example is probably the Cobra story that is said to occur in British India: there were a lot of cobras in Delhi, so the government offered a bounty for every dead cobra. A number of snakes were killed because people are after the bounty, and there were fewer cobras. People began to think that they can just breed cobras, kill them and bring the dead cobras to the government for the money. The government became aware of this, so they scrapped the program, and as the (intentionally bred) cobras were released, they ended up with even more cobras than before.

If you think about it you might be already aware of some other examples—you're probably just not aware that there is a law that describes it and it's called the Goodhart's Law:

  • We have long used exam scores to measure intelligence. Expectation: students actually study & internalize whatever they're studying. Reality: students end up studying to memorize things well enough until the day of the exams. Consequence: the metric of exam scores is no longer useful to measure a student's intelligence.
  • Universities prefer schools that have the better national exam scores. Expectation: schools improve teaching quality. Reality: some schools end up gaming (e.g. allowing cheating) in exams so that they're considered as good schools & are more preferred by universities. Consequence: exam scores no longer tell you the quality of a school.
  • A news website has a success metric of X page views per day. Expectation: their writers write engaging, high-quality stories. Reality: they write sensationalist stories that might or might not be true. Consequence: due to the low quality stories, the brand of the news website is tarnished, & people no longer take it seriously.

What it means for data science

Much of data science is optimizing a metric, from picking a model with the lowest error rate to minimizing cost in model training. Thus it's no surprise that we will come across the effect of Goodhart's Law in the data science process in some ways.

Hyperparameter tuning & choosing the best model permalink

Say we're doing a multi-class classification with cross-validation and grid search for hyperparameter tuning. We optimize for a single parameter (e.g. accuracy). Our script gives us the model with the parameters that have best mean accuracy across all folds. We celebrate: we've got our best model! When we check out the confusion matrix, it turns out the categories that we care the most about are not performing very well. This is not reflected in the metric that we have chosen to select the best parameters. The metric hides the fact that our model severely underperforms in the labels that we care about, & the "best" hyperparemeters that grid search has chosen for us may not be the best hyperparameters after all.

To avoid going back-and-forth on this (though perhaps sometimes it's inevitable) a good suggestion that I came across is to define your metrics independently of evaluation metrics (e.g. accuracy, precision, recall, etc.). I think this will help us to really decide what we really want to measure before diving head first into the evaluation metrics. With Goodhart's Law in mind, we might end up realizing that one single metric is not enough to determine which one is the best model. And if that's the case, for hyperparameter tuning we probably don't want to automate the process (e.g. grid searching everything) just like that.

If we're not yet sure whether there are dimensions of our data that might matter more than the others, stop here, don't assume, & speak to our users. If we're building an internal tool, we should engage with our users and discover their pain points as well as what dimensions of the data matter the most to them. If it's an external tool or if there is a UX team that has done their surveys or research on this, we can also engage with them.

Model drift permalink

We've deployed our model: yay! Our model does not perform as well as how it performed in our test data: sad face. There are a few possibilities:

  1. We didn't have enough training data
  2. There are external changes (e.g. changing market condition)
  3. The behavior of the data is altered due to our model's presence

Point (3) is where Goodhart's Law comes into play.

Regarding (3): we need to be aware of our model's feedback loops, & watch out for them. I think it's important to document your model's feedback loops early on—from the one we deliberately design to collect feedback to improve our model to negative feedback loops. Think of any possible ways that users' behaviors (or whatever it is you're building your model for) might change once our model is in place.

At this point maybe Goodhart's Law is inevitable: people will adjust their behaviors somewhat. Despite of that, it's still important to be aware of it (e.g. expect the performance on production data to be worse) & make plans for it.

Dealing with Goodhart's Law in data science

From those two examples, the following are my takeaways:

  • We need to understand that a metric is merely a proxy. It should not trump common sense & your understanding of the data.
  • Resist the temptation of "the one true metric that rules all". Consider using several metrics instead of a single metric. I guess you can say having both Precision and Recall, depending on the use case, is one of the ways to mitigate the shortcomings of Accuracy. Another example is having bounce rate as a supplement to CTR. Depending on your data and use case, you may want to extend the dimensionality further too.
    • I think it's important to think about this from the start, even from the development phase, because this can impact a) the granularity of data you need to process & track b) how you pick the best model.
    • A single metric may be tempting because it's easy to explain, and I think it's also important to have metrics that are easy to explain to others. However if you resist on using a single metric though you may be hiding important dimensionality that people might care about. A possible solution is have a couple of important metrics that you often cite to stakeholders, but in the background (e.g. internal monitoring dashboard) do keep track of more metrics.
    • Another argument for multiple metrics is that they are harder to game at once.
  • Pair quantity measurements with quality measurements. I think much of this goes back to using common sense and also engaging with the users to understand what they actually care about.
  • Invest in good monitoring & if possible, explainable models. In some cases, Goodhart's Law might be inevitable. Even if it's inevitable it's still important to be able to track how much drift is possibly caused by Goodhart's Law (good monitoring will help this) & which parts of your model are affected (having an explainable model will help this). This two information can help inform us what next steps we should take—do we need to refresh our model? How often? Are there particular features that contribute too much to the (perhaps negative) feedback loop & does it make sense to keep them?

References

The following resources helped me a lot to understand Goodhart's Law & metrics in general: