当我使用
HuberRegressor()
形式 sklearn.linear_model
对某个数据集运行 Huber 回归时,我希望保留所有样本的 0.1% 为异常值。据我所知,HuberRegressor() 通过参数 epsilon
控制异常值的大小,但它不支持百分比形式,并且我的响应变量不是来自正态分布。
期望的结果如下所示
import numpy as np
from sklearn.linear_model import HuberRegressor
X = np.random.rand(100, 3)
y = np.random.rand(100, 1)
model = HuberRegressor(epsilon=?, fit_intercept=True, alpha=0)
model.fit(X, y)
>>> (model.outliers_).sum / len(y)
>>> 0.001
此外,有没有一种方法可以普遍对使用 huber 损失函数的模型进行这样的调整?比如GLS+Huber还是GBRT+Huber?
我只能看到一种方法来做到这一点,那就是通过运行迭代过程来估计正确的
epsilon
值。
在一个玩具示例中,这将是这样的:
import numpy as np
from sklearn.linear_model import HuberRegressor, LinearRegression
from sklearn.datasets import make_regression
X, y, coef = make_regression(n_samples=200, n_features=2, noise=4.0, coef=True, random_state=0)
def find_epsilon(X, y, target_percentage, epsilon=1, tolerance=0.001, max_iter=100):
for _ in range(max_iter):
model = HuberRegressor(epsilon=epsilon, fit_intercept=True, alpha=0)
model.fit(X, y.ravel())
# Calculate the percentage of outliers
residuals = np.abs(y - model.predict(X).reshape(-1, 1))
median_abs_deviation = np.median(residuals)
outlier_mask = residuals > epsilon * median_abs_deviation
outlier_percentage = np.mean(outlier_mask)
# Check if the percentage is close to the target
if np.abs(outlier_percentage - target_percentage) < tolerance:
return epsilon
# Adjust epsilon
if outlier_percentage > target_percentage:
epsilon *= 1.1 # Increase epsilon if too many outliers
else:
epsilon *= 0.9 # Decrease epsilon if too few outliers
return epsilon
# Target 0.1% outliers
target_percentage = 0.001
epsilon = find_epsilon(X, y, target_percentage)
现在我们有了适合模型的估计:
# Fit the model with the calculated epsilon
model = HuberRegressor(epsilon=epsilon, fit_intercept=True, alpha=0)
model.fit(X, y.ravel())
# Check the final percentage of outliers
residuals = np.abs(y - model.predict(X).reshape(-1, 1))
median_abs_deviation = np.median(residuals)
outlier_mask = residuals > epsilon * median_abs_deviation
outlier_percentage = np.mean(outlier_mask)
print("Epsilon:", epsilon)
print("Outlier Percentage:", outlier_percentage)
这个打印
Epsilon: 4.594972986357222
Outlier Percentage: 0.00115
非常接近期望的0.1%!
注意:表达式
outlier_mask = residuals > epsilon * median_abs_deviation
是基于Huber损失函数逻辑的代理。