我正在尝试使用对数丢失作为丢失功能,使用Kaggle的Kobe Bryant shot selection competition中的数据进行Caret训练。
这是我的脚本:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 10)
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)
这是错误的回溯:
7: stop("'data' should have columns consistent with 'lev'")
6: ctrl$summaryFunction(testOutput, lev, method)
5: evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,
metric = metric, method = method)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(shot_made_flag ~ ., data = train, method = "gbm",
preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
1: train(shot_made_flag ~ ., data = train, method = "gbm", preProc = c("zv",
"center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
[当我将defaultFunction
用作summaryFunction且未在train中指定指标时,它可以工作,但对于mnLogLoss
不起作用。我猜想它期望的数据格式与我传递的格式不同,但是我找不到错误所在。
从defaultSummary
的帮助文件:
要使用twoClassSummary和/或mnLogLoss,trainControl的classProbs参数应为TRUE。 multiClassSummary可以在没有类别概率的情况下使用,但是某些统计信息(例如,总体日志损失和ROC曲线下每个类别的平均面积)将不在结果集中。
因此,我认为您需要将trainControl()
更改为以下内容:
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss, classProbs = TRUE)
如果执行此操作并运行代码,则会出现以下错误:
Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
您只需要将shot_made_flag
的0/1级别更改为可以是有效的R变量名称:
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
通过上述更改,您的代码将如下所示:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 3)
ctrl <- trainControl(method = "repeatedcv", classProbs = TRUE, index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)