编辑:正如Dwin在评论中指出的那样,下面的代码不适用于ROC曲线。 ROC曲线必须在t
的变化中进行索引,而不是在lambda
中进行索引(如下所述)。我有机会时会编辑下面的代码。
下面是我尝试创建一个预测二元结果的glmnet的ROC曲线。我在下面的代码中模拟了一个近似glmnet结果的矩阵。正如你们中的一些人所知,给定一个n x p输入矩阵,glmnet为100个不同的lambda值输出一个n×100的预测概率矩阵[$ \ Pr(y_i = 1)$]。如果λ的进一步变化停止增加预测能力,则输出将窄于100。下面的glmnet预测概率的模拟矩阵是250x69矩阵。
首先,是否有更简单的方法来绘制glmnet ROC曲线?其次,如果没有,下面的方法是否正确?第三,我是否关心绘制(1)假/真阳性的概率或(2)简单地观察到的假/真阳性率?
set.seed(06511)
# Simulate predictions matrix
phat = as.matrix(rnorm(250,mean=0.35, sd = 0.12))
lambda_effect = as.matrix(seq(from = 1.01, to = 1.35, by = 0.005))
phat = phat %*% t(lambda_effect)
#Choose a cut-point
t = 0.5
#Define a predictions matrix
predictions = ifelse(phat >= t, 1, 0)
##Simulate y matrix
y_phat = apply(phat, 1, mean) + rnorm(250,0.05,0.10)
y_obs = ifelse(y_phat >= 0.55, 1, 0)
#percentage of 1 observations in the validation set,
p = length(which(y_obs==1))/length(y_obs)
# dim(testframe2_e2)
#probability of the model predicting 1 while the true value of the observation is 0,
apply(predictions, 1, sum)
## Count false positives for each model
## False pos ==1, correct == 0, false neg == -1
error_mat = predictions - y_obs
## Define a matrix that isolates false positives
error_mat_fp = ifelse(error_mat ==1, 1, 0)
false_pos_rate = apply(error_mat_fp, 2, sum)/length(y_obs)
# Count true positives for each model
## True pos == 2, mistakes == 1, true neg == 0
error_mat2 = predictions + y_obs
## Isolate true positives
error_mat_tp = ifelse(error_mat2 ==2, 1, 0)
true_pos_rate = apply(error_mat_tp, 2, sum)/length(y_obs)
## Do I care about (1) this probability OR (2) simply the observed rate?
## (1)
#probability of false-positive,
p_fp = false_pos_rate/(1-p)
#probability of true-positive,
p_tp = true_pos_rate/p
#plot the ROC,
plot(p_fp, p_tp)
## (2)
plot(false_pos_rate, true_pos_rate)
关于这个问题有一个问题,但答案很粗糙,并不完全正确:glmnet lasso ROC charts
使用ROCR
计算AUC和绘制ROC曲线的选项:
library(ROCR)
library(glmnet)
library(caret)
df <- data.matrix(… ) # dataframe w/ predictor variables & a response variable
# col1 = response var; # cols 2:10 = predictor vars
# Create training subset for model development & testing set for model performance testing
inTrain <- createDataPartition(df$ResponsVar, p = .75, list = FALSE)
Train <- df[ inTrain, ]
Test <- df[ -inTrain, ]
# Run model over training dataset
lasso.model <- cv.glmnet(x = Train[,2:10], y = Train[,1],
family = 'binomial', type.measure = 'auc')
# Apply model to testing dataset
Test$lasso.prob <- predict(lasso.model,type="response",
newx = Test[,2:10], s = 'lambda.min')
pred <- prediction(Test$lasso.prob, Test$ResponseVar)
# calculate probabilities for TPR/FPR for predictions
perf <- performance(pred,"tpr","fpr")
performance(pred,"auc") # shows calculated AUC for model
plot(perf,colorize=FALSE, col="black") # plot ROC curve
lines(c(0,1),c(0,1),col = "gray", lty = 4 )
对于上面的Test$lasso.prob
,您可以输入不同的lambdas来测试每个值的预测能力。
通过预测和标签,以下是如何创建基本ROC曲线
# randomly generated data for example, binary outcome
predictions = runif(100, min=0, max=1)
labels = as.numeric(predictions > 0.5)
labels[1:10] = abs(labels[1:10] - 1) # randomly make some labels not match predictions
# source: https://blog.revolutionanalytics.com/2016/08/roc-curves-in-two-lines-of-code.html
labels_reordered = labels[order(predictions, decreasing=TRUE)]
roc_dat = data.frame(TPR=cumsum(labels_reordered)/sum(labels_reordered), FPR=cumsum(!labels_reordered)/sum(!labels_reordered))
# plot the roc curve
plot(roc_dat$FPR, roc_dat$TPR)