Anomaly Detection: A simple and useful way

6 min readOct 23, 2023

If you have an adequate sample of your data and add the probabilistic modeling of the Gaussian curve, you will have a powerful statistical inference tool that can help you in your daily life.

https://www.w3schools.com/statistics/statistics_normal_distribution.php

Why is it important? Let’s look at some use cases

You may have measurements coming from devices and want to detect a defect or a change in location.
You may want to detect stock measurements in the financial market to know the best time to buy or sell the stocks.
You may want to detect fraud or if a heartbeat is out of the ordinary.

The idea of this type of anomaly detection is to identify rare and different observations.

https://neptune.ai/blog/anomaly-detection-in-time-series

As you can see, we could list several use cases here. You can use it to help solve different problems in your context.

Anomaly detection — Practical use

A population can be thought of not only as a physical group of individuals, but also as the provider of the probability distribuition for a random observation.

This is the foundation of this method, we have a population (or representative sample) and we will obtain it’s distribuition to be able to know wether a new observation collected belongs to that population or not.

Alert: The main premise of this method is that the distribution of the reference population or sample must be normal!

With this assumption met, we can follow the empirical rule, also called the 68–95–99.7 rule. The rule tells us that, for a normal distribuition, there’s a:

68 % chance a data point falls within 1 standard deviation of the mean
95 % chance a data point falls within 2 standard deviations of the mean
99.7 % chance a data point falls within 3 standard deviations of the mean

And we can draw all kinds of conclusions based on this information, for example, if a new observation falls within 2 or 3 standard deviations of the mean, that observation may be an anomaly. This threshold will depend on each person’s context, for example, it may be that for a given context it is more acceptable to have more false positives than false negatives.

In Python, this can be modeled as per the code below:

'''
Component to develop functions related to anomaly detection
'''

# import necessary packages
import numpy as np
from scipy import stats


def detect_outliers_iqr(data: np.array, k=1.5, return_thresholds=False) -> np.array:
    '''
    Detect outliers in a dataset using the interquartile range (IQR) method.

    Parameters:
        data (array-like): Input data to detect outliers from.
        k (float): Multiplier to control the outlier cutoff (default: 1.5).
        return_thresholds (bool): Whether to return the lower and upper bounds (default: False).

    Returns:
        outliers (array-like or tuple): Boolean mask of outliers or lower and upper bounds.
    '''
    # Calculate quartiles
    q25, q75 = np.percentile(data, [25, 75])

    # Calculate the IQR
    iqr = q75 - q25

    # Calculate the outlier cutoff
    cutoff = iqr * k

    # Calculate the lower and upper bounds
    lower_bound, upper_bound = q25 - cutoff, q75 + cutoff

    if return_thresholds:
        return lower_bound, upper_bound
    else:
        # Identify outliers
        outliers = np.logical_or(data < lower_bound, data > upper_bound)
        return outliers


class AnomalyTransformer:
    def __init__(self, data: np.array):
        '''
        AnomalyTransformer class for outlier elimination.

        Parameters:
            data (array-like): Input data to be transformed.
        '''
        self.data = data
        self.transformed_data = None

    def fit_transform(self) -> np.array:
        '''
        Fit the data and transform it using outlier elimination.
        '''
        # Eliminate outliers
        outliers = detect_outliers_iqr(self.data)
        data_with_nan = np.where(outliers, np.nan, self.data)
        data_without_nan = data_with_nan[~np.isnan(data_with_nan)]

        # Transformed data
        self.transformed_data = data_without_nan


class AnomalyDetector:
    def __init__(self, transformed_data: np.array, mean: float, std: float, threshold: float):
        '''
        AnomalyDetector class for detecting anomalies based on transformed data and
        generate final reports to help decision making.

        Parameters:
            transformed_data (array-like): Transformed data for anomaly detection.
            mean (float): Mean of the transformed data.
            std (float): Standard deviation of the transformed data.
            threshold (float): Threshold for anomaly detection.
        '''
        self.transformed_data = transformed_data
        self.mean = mean
        self.std = std
        self.threshold = threshold

    def is_anomaly(self, value: float) -> bool:
        '''
        Check if a value is an anomaly based on the mean, standard deviation, and threshold.

        Parameters:
            value (float): The value to be checked.

        Returns:
            is_anomaly (bool): True if the value is an anomaly, False otherwise.
        '''
        if value > (
            self.mean + self.threshold) or value < (self.mean - self.threshold):
            return True
        else:
            return False
        
    def anomaly_report(self, value: float) -> float:
        '''
        Generate a report to measure a specific value: p-value.

        Parameters:
            value (float): The value to generate the report for.

        Returns:
            p_value (float): The p-value of statistic test.
        '''
        # Calculate p-value
        z_score = (value - self.mean) / self.std
        p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
        return p_value

Let’s break down this code:

First, we clean outliers from our distribuition using the interquartile range (IQR) method with the “detect_outliers_iqr” function
Then we apply this outlier cleaning with “AnomalyTransformer” class. This class was created so that we can perform transformations on the data, note that depending on its original distribution, it may be necessary to create more transformations such as log or boxcox so that your data is distributed normally, if it is not originally.
See the code example below, adding a boxcox transformation.

class AnomalyTransformer:
    def __init__(self, data):
        '''
        AnomalyTransformer class for outlier elimination and data transformation using Box-Cox.

        Parameters:
            data (array-like): Input data to be transformed.
        '''
        self.data = data
        self.fitted_data = None
        self.fitted_lambda = None
        self.transformed_data = None

    def fit_transform(self):
        '''
        Fit the data and transform it using outlier elimination and Box-Cox transformation.
        '''
        # Eliminate outliers
        outliers = detect_outliers_iqr(self.data)
        data_with_nan = np.where(outliers, np.nan, self.data)
        data_without_nan = data_with_nan[~np.isnan(data_with_nan)]

        # Transform data with Box-Cox
        self.fitted_data, self.fitted_lambda = stats.boxcox(data_without_nan)
        self.transformed_data = self.fitted_data

    def transform_value(self, value):
        '''
        Transform a single value using Box-Cox transformation.

        Parameters:
            value: The value to be transformed.

        Returns:
            transformed_value: The transformed value.
        '''
        return stats.boxcox(value, self.fitted_lambda)

3. After that we run the “AnomalyDetector” class which simply compares the new observation with the distribution cleaned of outliers and/or transformed to check whether this observation falls within 2 or 3 standard deviations (for example) of the mean and is considered an anomaly or not.

4. Finally, we can generate a report to obtain the z-score and thus obtain a p-value, which is the specific probability of obtaining a result at least as extreme as what you observed if the null hypothesis is true.

The p-value is a measure of the incompatibility between observed data and a null hypothesis. This gives us an extra layer of security when determining wether a new observation is an anomaly or not.

Speaking of hypothesis testing, we could also model this method in terms of null hypothesis and alternative hypothesis like this:

H0: the data tested is not an anomaly
H1: the data tested is an anomaly

We often create hypothesis tests to compare means or to see if a given sample came from a given population. In this case we only have an observation to be compared, not a mean, which may open for some further discussion.

Well, that’s it! The final message I’d like to leave is:

Check your assumptions and make it clear when this was not possible.
Statistical methods should enable data to answer scientific questions: Ask “why am I doing this?” rather than focusing on which technique to use.
Signals always arrive with noise: Trying to separate one from the other is what makes things interesting. The variability is significant, and probability models are useful as an abstraction.
Worry about data quality: It all depends on the data.
Statistical analysis is more than a set of computations: Don’t be content with plugging in formulas or running procedures in software without knowing why you’re doing it.
Keep it simple: Main communication should be as basic as possible — don’t show off complex modeling skills unless it’s really necessary.

If you cannot get your data to be normally distributed, this method will not be very useful, as the results may be incorrect. In this case, try other methods, perhaps more complex ones in which this premise of normal distribution does not need to be met.

PS: you may see this method called outlier detector in other content. This method is one of the most common and simple, which for me is what makes it amazing and very good to solve problems.

If you want to see a complete and well-documented project of this method being implemented in practice from end to end, follow this link.

Anomaly Detection: A simple and useful way

Written by Pandas Couple

No responses yet