在 HuberRegressor 中保留 0.1% 的样本作为异常值

问题描述 投票:0回答:1

当我使用

HuberRegressor()
形式
sklearn.linear_model
对某个数据集运行 Huber 回归时,我希望保留所有样本的 0.1% 为异常值。据我所知,HuberRegressor() 通过参数
epsilon
控制异常值的大小,但它不支持百分比形式,并且我的响应变量不是来自正态分布

期望的结果如下所示

import numpy as np
from sklearn.linear_model import HuberRegressor

X = np.random.rand(100, 3)
y = np.random.rand(100, 1)
model = HuberRegressor(epsilon=?, fit_intercept=True, alpha=0)
model.fit(X, y)

>>> (model.outliers_).sum / len(y)
>>> 0.001

此外,有没有一种方法可以普遍对使用 huber 损失函数的模型进行这样的调整?比如GLS+Huber还是GBRT+Huber?

python machine-learning scikit-learn linear-regression
1个回答
0
投票

我只能看到一种方法来做到这一点,那就是通过运行迭代过程来估计正确的

epsilon
值。

在一个玩具示例中,这将是这样的:

import numpy as np
from sklearn.linear_model import HuberRegressor, LinearRegression
from sklearn.datasets import make_regression

X, y, coef = make_regression(n_samples=200, n_features=2, noise=4.0, coef=True, random_state=0)

def find_epsilon(X, y, target_percentage, epsilon=1, tolerance=0.001, max_iter=100):
    for _ in range(max_iter):
        model = HuberRegressor(epsilon=epsilon, fit_intercept=True, alpha=0)
        model.fit(X, y.ravel())

        # Calculate the percentage of outliers
        residuals = np.abs(y - model.predict(X).reshape(-1, 1))
        median_abs_deviation = np.median(residuals)
        outlier_mask = residuals > epsilon * median_abs_deviation
        outlier_percentage = np.mean(outlier_mask)

        # Check if the percentage is close to the target
        if np.abs(outlier_percentage - target_percentage) < tolerance:
            return epsilon

        # Adjust epsilon
        if outlier_percentage > target_percentage:
            epsilon *= 1.1  # Increase epsilon if too many outliers
        else:
            epsilon *= 0.9  # Decrease epsilon if too few outliers

    return epsilon

# Target 0.1% outliers
target_percentage = 0.001
epsilon = find_epsilon(X, y, target_percentage)

现在我们有了适合模型的估计:

# Fit the model with the calculated epsilon
model = HuberRegressor(epsilon=epsilon, fit_intercept=True, alpha=0)
model.fit(X, y.ravel())

# Check the final percentage of outliers
residuals = np.abs(y - model.predict(X).reshape(-1, 1))
median_abs_deviation = np.median(residuals)
outlier_mask = residuals > epsilon * median_abs_deviation
outlier_percentage = np.mean(outlier_mask)

print("Epsilon:", epsilon)
print("Outlier Percentage:", outlier_percentage)

这个打印

Epsilon: 4.594972986357222
Outlier Percentage: 0.00115

非常接近期望的0.1%!

注意:表达式

outlier_mask = residuals > epsilon * median_abs_deviation
是基于Huber损失函数逻辑的代理。

© www.soinside.com 2019 - 2024. All rights reserved.