从逻辑回归系数中推导出新的连续变量

问题描述 投票:0回答:1

我有一组自变量X和因变量Y的值集。手头的任务是二项式分类,即预测债务人是否会违约(1)或不(0)。在筛选出导致多重共线性的统计上无关紧要的变量和变量之后,我留下以下逻辑回归模型的总结:

Accuracy ~0.87
Confusion matrix [[1038 254]
                  [72 1182]]
Parameters Coefficients
intercept  -4.210
A          5.119
B          0.873
C          -1.414
D          3.757

现在,我通过log odds_ratio将这些系数转换为新的连续变量“default_probability”,即

import math
e = math.e
power = (-4.210*1) + (A*5.119) + (B*0.873) + (C*-1.414) + (D*3.757)
default_probability = (e**power)/(1+(e**power))

当我根据这个新的连续变量“默认概率”将原始数据集划分为四分位数时,则:

1st quartile contains 65% of defaulted debts (577 out of 884 incidents)
2nd quartile contains 23% of defaulted debts (206 out of 884 incidents)
3rd quartile contains 9% of defaulted debts (77 out of 884 incidents)
4th quartile contains 3% of defaulted debts (24 out of 884 incidents)

同时:

overall quantity of debtors in 1st quartile - 1145
overall quantity of debtors in 1st quartile - 516
overall quantity of debtors in 1st quartile - 255
overall quantity of debtors in 1st quartile - 3043

我想用“违约概率”通过强加商业规则“没有信用到第一个四分位数”手术去除最有问题的信用,但现在我想知道它是否是“外科手术”(按照这个规则,我将失去( 1145 - 577 = 568“好”客户)并且总体上在数学上/逻辑上是正确的,以通过上述推理线从逻辑回归系数中导出数据集的新连续变量?

python-3.x machine-learning scikit-learn classification logistic-regression
1个回答
1
投票

你计算power时已经忘记了拦截。但是假设这只是你在评论中所说的错字,那么你的方法是有效的。但是,您可能想使用scikit-learnpredict_proba函数,这将为您省去麻烦。例:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import numpy as np

data = load_breast_cancer()
X = data.data
y = data.target

lr = LogisticRegression()

lr.fit(X,y)

假设我想计算给定观察属于第1类的概率(比如观察i),我可以做你已经完成的事情,基本上使用回归系数和你所做的拦截:

i = 0
1/(1+np.exp(-X[i].dot(lr.coef_[0])-lr.intercept_[0]))

或者只是做:

lr.predict_proba(X)[i][1]

这更快

© www.soinside.com 2019 - 2024. All rights reserved.