插入符号和summaryFunction mnLogLoss错误:与'lev'一致的列

问题描述 投票:0回答:1

我正在尝试使用对数丢失作为丢失功能,使用Kaggle的Kobe Bryant shot selection competition中的数据进行Caret训练。

这是我的脚本:

library(caret)
data <- read.csv("./data.csv")

data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL

train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]

inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]

folds <- createFolds(train$shot_made_flag, k = 10)

ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)

这是错误的回溯:

7: stop("'data' should have columns consistent with 'lev'")
6: ctrl$summaryFunction(testOutput, lev, method)
5: evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels, 
       metric = metric, method = method)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(shot_made_flag ~ ., data = train, method = "gbm", 
       preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", 
       verbose = FALSE)
1: train(shot_made_flag ~ ., data = train, method = "gbm", preProc = c("zv", 
       "center", "scale"), trControl = ctrl, metric = "logLoss", 
       verbose = FALSE)

[当我将defaultFunction用作summaryFunction且未在train中指定指标时,它可以工作,但对于mnLogLoss不起作用。我猜想它期望的数据格式与我传递的格式不同,但是我找不到错误所在。

r caret
1个回答
0
投票

defaultSummary的帮助文件:

要使用twoClassSummary和/或mnLogLoss,trainControl的classProbs参数应为TRUE。 multiClassSummary可以在没有类别概率的情况下使用,但是某些统计信息(例如,总体日志损失和ROC曲线下每个类别的平均面积)将不在结果集中。

因此,我认为您需要将trainControl()更改为以下内容:

ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss, classProbs = TRUE)

如果执行此操作并运行代码,则会出现以下错误:

Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to  X0, X1 . Please use factor levels that can be used as valid R variable names  (see ?make.names for help).

您只需要将shot_made_flag的0/1级别更改为可以是有效的R变量名称:

data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")

通过上述更改,您的代码将如下所示:

library(caret)
data <- read.csv("./data.csv") 

data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL

train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]

inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]

folds <- createFolds(train$shot_made_flag, k = 3)

ctrl <- trainControl(method = "repeatedcv", classProbs = TRUE, index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)
© www.soinside.com 2019 - 2024. All rights reserved.