用于测试和训练的分层分组

问题描述 投票:0回答:1

我正在使用此数据集,但我认为分层抽样会更好,因为二进制结果变量由90%== 1和10%== 0组成。所以下面的代码可以工作,但是随机抽样

x = model.matrix(BirthWeightOz ~., ncbirths )
y = as.vector(ncbirths$BirthWeightOz)

x[,c(3,4,8,10)] <- scale(x[,c(3,4,8,10)]) # non-factor variabelen omzetten naar z-scores
x<- x[,-1] # verwijderen van de kolom intercept

set.seed(1)
train <- sample(1:nrow(x), nrow(x)*0.8)   # random sampling train set van 80%
test <- (1:nrow(x))[-train]  #  testset van 20%
y.test = y[test]
y.train = y[train]



# 10-fold cross validation for ridge regression
result.ridge.cv <- cv.glmnet(x[train,], y[train], alpha = 0,lambda = 10^seq(-2, 5, length.out = 50), nfolds = 10)
print(result.ridge.cv$lambda.min) # Best cross validated lambda
print(result.ridge.cv$lambda.1se) # Conservative est. of best lambda (1 stdev)

## To plot Root Mean Squared Error (RMSE) to be on the same scale as y_train_ncb:
result.ridge.cv$cvm <- result.ridge.cv$cvm^0.5
result.ridge.cv$cvup <- result.ridge.cv$cvup^0.5
result.ridge.cv$cvlo <- result.ridge.cv$cvlo^0.5
plot(result.ridge.cv, ylab = "Root Mean-Squared Error")
print(c(result.ridge.cv$lambda.min, result.ridge.cv$lambda.1se))

result.ridge.best <- glmnet(x[train,], y[train], alpha = 0, lambda = result.ridge.cv$lambda.1se,standardize = TRUE, intercept = TRUE)
result.ridge.best$beta

ridge.pred = predict(result.ridge.best, newx = x[test,])
ridge.pred_train = predict(result.ridge.best, newx = x[train,])
ridge_mse_test <- mean((ridge.pred-y.test)^2)
ridge_mse_train <- mean((ridge.pred_train-y.train)^2)
# MSE voor de train set = 217.77 en voor de test set 217.912

[如果我尝试这样的操作:`那么从10倍交叉验证开始的其余代码就不能很好地工作。您能帮助我如何对样本进行分层,但仍然是一个向量。

train.index <- createDataPartition(ncbirths$BirthWeightOz, p = .8, list = FALSE)
train <- ncbirths[ train.index,]
test  <- ncbirths[-train.index,]
y.train = as.vector(ncb_train$BirthWeightOz)
y.test = as.vector(ncb_test$BirthWeightOz)
``
r sampling
1个回答
0
投票

我无法复制您的代码,因为我没有数据集。但是,您可以从this answer.

执行分层随机抽样
热门问题
推荐问题
最新问题