Anomaly Detection: A simple and useful way
If you have an adequate sample of your data and add the probabilistic modeling of the Gaussian curve, you will have a powerful statistical inference tool that can help you in your daily life.
Why is it important? Let’s look at some use cases
- You may have measurements coming from devices and want to detect a defect or a change in location.
- You may want to detect stock measurements in the financial market to know the best time to buy or sell the stocks.
- You may want to detect fraud or if a heartbeat is out of the ordinary.
The idea of this type of anomaly detection is to identify rare and different observations.
As you can see, we could list several use cases here. You can use it to help solve different problems in your context.
Anomaly detection — Practical use
A population can be thought of not only as a physical group of individuals, but also as the provider of the probability distribuition for a random observation.
This is the foundation of this method, we have a population (or representative sample) and we will obtain it’s distribuition to be able to know wether a new observation collected belongs to that population or not.
Alert: The main premise of this method is that the distribution of the reference population or sample must be normal!
With this assumption met, we can follow the empirical rule, also called the 68–95–99.7 rule. The rule tells us that, for a normal distribuition, there’s a:
- 68 % chance a data point falls within 1 standard deviation of the mean
- 95 % chance a data point falls within 2 standard deviations of the mean
- 99.7 % chance a data point falls within 3 standard deviations of the mean
And we can draw all kinds of conclusions based on this information, for example, if a new observation falls within 2 or 3 standard deviations of the mean, that observation may be an anomaly. This threshold will depend on each person’s context, for example, it may be that for a given context it is more acceptable to have more false positives than false negatives.
In Python, this can be modeled as per the code below:
'''
Component to develop functions related to anomaly detection
'''
# import necessary packages
import numpy as np
from scipy import stats
def detect_outliers_iqr(data: np.array, k=1.5, return_thresholds=False) -> np.array:
'''
Detect outliers in a dataset using the interquartile range (IQR) method.
Parameters:
data (array-like): Input data to detect outliers from.
k (float): Multiplier to control the outlier cutoff (default: 1.5).
return_thresholds (bool): Whether to return the lower and upper bounds (default: False).
Returns:
outliers (array-like or tuple): Boolean mask of outliers or lower and upper bounds.
'''
# Calculate quartiles
q25, q75 = np.percentile(data, [25, 75])
# Calculate the IQR
iqr = q75 - q25
# Calculate the outlier cutoff
cutoff = iqr * k
# Calculate the lower and upper bounds
lower_bound, upper_bound = q25 - cutoff, q75 + cutoff
if return_thresholds:
return lower_bound, upper_bound
else:
# Identify outliers
outliers = np.logical_or(data < lower_bound, data > upper_bound)
return outliers
class AnomalyTransformer:
def __init__(self, data: np.array):
'''
AnomalyTransformer class for outlier elimination.
Parameters:
data (array-like): Input data to be transformed.
'''
self.data = data
self.transformed_data = None
def fit_transform(self) -> np.array:
'''
Fit the data and transform it using outlier elimination.
'''
# Eliminate outliers
outliers = detect_outliers_iqr(self.data)
data_with_nan = np.where(outliers, np.nan, self.data)
data_without_nan = data_with_nan[~np.isnan(data_with_nan)]
# Transformed data
self.transformed_data = data_without_nan
class AnomalyDetector:
def __init__(self, transformed_data: np.array, mean: float, std: float, threshold: float):
'''
AnomalyDetector class for detecting anomalies based on transformed data and
generate final reports to help decision making.
Parameters:
transformed_data (array-like): Transformed data for anomaly detection.
mean (float): Mean of the transformed data.
std (float): Standard deviation of the transformed data.
threshold (float): Threshold for anomaly detection.
'''
self.transformed_data = transformed_data
self.mean = mean
self.std = std
self.threshold = threshold
def is_anomaly(self, value: float) -> bool:
'''
Check if a value is an anomaly based on the mean, standard deviation, and threshold.
Parameters:
value (float): The value to be checked.
Returns:
is_anomaly (bool): True if the value is an anomaly, False otherwise.
'''
if value > (
self.mean + self.threshold) or value < (self.mean - self.threshold):
return True
else:
return False
def anomaly_report(self, value: float) -> float:
'''
Generate a report to measure a specific value: p-value.
Parameters:
value (float): The value to generate the report for.
Returns:
p_value (float): The p-value of statistic test.
'''
# Calculate p-value
z_score = (value - self.mean) / self.std
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
return p_value
Let’s break down this code:
- First, we clean outliers from our distribuition using the interquartile range (IQR) method with the “detect_outliers_iqr” function
- Then we apply this outlier cleaning with “AnomalyTransformer” class. This class was created so that we can perform transformations on the data, note that depending on its original distribution, it may be necessary to create more transformations such as log or boxcox so that your data is distributed normally, if it is not originally.
See the code example below, adding a boxcox transformation.
class AnomalyTransformer:
def __init__(self, data):
'''
AnomalyTransformer class for outlier elimination and data transformation using Box-Cox.
Parameters:
data (array-like): Input data to be transformed.
'''
self.data = data
self.fitted_data = None
self.fitted_lambda = None
self.transformed_data = None
def fit_transform(self):
'''
Fit the data and transform it using outlier elimination and Box-Cox transformation.
'''
# Eliminate outliers
outliers = detect_outliers_iqr(self.data)
data_with_nan = np.where(outliers, np.nan, self.data)
data_without_nan = data_with_nan[~np.isnan(data_with_nan)]
# Transform data with Box-Cox
self.fitted_data, self.fitted_lambda = stats.boxcox(data_without_nan)
self.transformed_data = self.fitted_data
def transform_value(self, value):
'''
Transform a single value using Box-Cox transformation.
Parameters:
value: The value to be transformed.
Returns:
transformed_value: The transformed value.
'''
return stats.boxcox(value, self.fitted_lambda)
3. After that we run the “AnomalyDetector” class which simply compares the new observation with the distribution cleaned of outliers and/or transformed to check whether this observation falls within 2 or 3 standard deviations (for example) of the mean and is considered an anomaly or not.
4. Finally, we can generate a report to obtain the z-score and thus obtain a p-value, which is the specific probability of obtaining a result at least as extreme as what you observed if the null hypothesis is true.
The p-value is a measure of the incompatibility between observed data and a null hypothesis. This gives us an extra layer of security when determining wether a new observation is an anomaly or not.
Speaking of hypothesis testing, we could also model this method in terms of null hypothesis and alternative hypothesis like this:
- H0: the data tested is not an anomaly
- H1: the data tested is an anomaly
We often create hypothesis tests to compare means or to see if a given sample came from a given population. In this case we only have an observation to be compared, not a mean, which may open for some further discussion.
Well, that’s it! The final message I’d like to leave is:
- Check your assumptions and make it clear when this was not possible.
- Statistical methods should enable data to answer scientific questions: Ask “why am I doing this?” rather than focusing on which technique to use.
- Signals always arrive with noise: Trying to separate one from the other is what makes things interesting. The variability is significant, and probability models are useful as an abstraction.
- Worry about data quality: It all depends on the data.
- Statistical analysis is more than a set of computations: Don’t be content with plugging in formulas or running procedures in software without knowing why you’re doing it.
- Keep it simple: Main communication should be as basic as possible — don’t show off complex modeling skills unless it’s really necessary.
If you cannot get your data to be normally distributed, this method will not be very useful, as the results may be incorrect. In this case, try other methods, perhaps more complex ones in which this premise of normal distribution does not need to be met.
PS: you may see this method called outlier detector in other content. This method is one of the most common and simple, which for me is what makes it amazing and very good to solve problems.
If you want to see a complete and well-documented project of this method being implemented in practice from end to end, follow this link.