Evaluating Statistical Models; Why You are too Specific and Should be More SensitiveDerek Tolbert
Derek Tolbert
Published on Thu Apr 08 2021


Following the widespread availability of cloud computing resources, artificial intelligence has become a popular component of modern businesses. Due to this popularity, it is increasingly common to see a promising Twitter or LinkedIn article claiming impressive performance on a machine learning task. Frequently these articles will have a catchy title that follows the format, "So-and-so trained an AI model that can predict X with 99% accuracy!". While 99% accuracy can be an impressive metric, further statistical analysis may find concerning short-comings that end-users should be aware of.

An example

To demonstrate why accuracy can often be an inappropriate measurement of performance, let's quickly construct a model to detect if a person in the USA is infected with the flu. Here is the implementation of our model in Python.

# Performs a complex statistical analysis to detect the presence of the common flu
def doesPersonHaveFlu(person: any) -> bool:
    return False

It should be clear that our "model" is a bit concerning. No matter what data we feed in about the person, it simply returns False! But what happens when we evaluate our model's accuracy on the entire population of the USA? We can do some quick and dirty approximations to get an idea of how accurate our model might be. According to the CDC, the estimated annual prevalence of the Flu between 2010-2020 is between 9.3 million to 45 million. With a population of ~331 million people in 2020, we get a prevalence around 2.8% to 13.6%. If we assume that we tested every person in the USA with our model when they were actively sick we would still end up with an accuracy between 86.4% to 97.2%. Realistically, these people were not sick at the same time and the model would receive a much higher accuracy.

The core problem here is an imbalanced dataset. At any point in time, there are far more people without the flu than there are with the flu. By always predicting the majority class (that a person does not have the flu), our accuracy metric remains high. However, this is not what most people have in mind when they think of a model that can accurately predict the flu. To better evaluate a model against an imbalanced dataset, we need to use different statistics.

Sensitivity and Specificity

Why so sensitive?

Statistically, sensitivity can be interpreted as the probability of correctly predicting True when the person actually has the flu. This can be computed in our flu example as follows

sensitivity = True_Positives / Total_People_With_Flu

If we plug in the numbers from our model and assume 45 million people had the flu in 2020, we get the following. (note: we never predicted True, so we do not have any True_Positives)

sensitivity = 0.0 / 45000000.0

Our model has a sensitivity of 0, ouch. Because sensitivity gives us the probability of correctly identifying a sick person (0%), we can see that our model will never correctly identify a sick person!

Could you be more specific?

Statistically, specificity can be interpreted as the probability of correctly predicting False when the person does not have the flu. We can compute it like so

specificity = True_Negatives / Total_People_Without_Flu

And if we plug in our numbers once again (note: we correctly predicted False for all the people who did not have the flu).

specificity = 45000000.0 / 45000000.0

Our model has a specificity of 1.0, it is good at something! Because specificity gives us the probability of correctly identifying a non-sick person (100%), we can now see that our model will never incorrectly tell someone that they have the flu.

Often times it is necessary to bias models to keep the sensitivity and specificity in an acceptable balance. Common approaches for this include weighting the classes (i.e. categories) in the dataset, under-sampling the majority class, over-sampling the minority class, or generating synthetic data.

Putting it all together

As we have seen, metrics can sometimes be misleading when blindly evaluating statistical models. It is important that the development team and the end user are in collaboration about what the model will need to actually do when it is out in production. For some circumstances, it may be acceptable to have a high sensitivity and for others a high specificity may be desired. During this collaboration step, a clean Test Dataset should be determined. By agreeing upon this test set prior to development, it is easier to determine which metrics will best capture the desired outcome.

When comparing multiple models, it can often be difficult to evaluate which one is the best using Sensitivity and Specificity alone. For these cases, it is often helpful to use an aggregated measure of performance such as AUC-ROC (my personal favorite!) or F1.

We are a solutions architecture company. Our team is comprised of engineers, developers, and physicists with strong problem-solving and software development skills. We can provide a wide variety of services as required and do everything necessary to take an idea from its conceptual phase, to a prototype, and on to deployment! We'd love to talk about your specific needs and ideas. Contact us now!