我正在考虑创建三个预测模型; Logistic、SVM 和决策树来预测数据集的默认值。
数据集由 15 个特征和默认的二元响应组成。数据是不平衡的,所以我对多数类进行了欠采样以创建一个平衡的数据集并对特征进行标准化(均值=0,sd=1)。然后我使用 LASSO 正则化如下来减少特征的数量
我做了以下事情;
# Create training and testing sets (80%, 20% split)
set.seed(q)
train_index <- sample(nrow(data), nrow(data) * 0.8)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
Y_train <- Y[train_index, ]
Y_test <- Y[-train_index, ]
# Perform LASSO regularization
cvfit <- cv.glmnet(X_train, Y_train, alpha = 1, family = "binomial",
nfolds = 5)
plot(cvfit)
best_lambda <- cvfit$lambda.1se
lasso_fit <- glmnet(X_train, Y_train, alpha = 1, family = "binomial",
lambda = best_lambda)
# Find selected features
selected_features <- predict(lasso_fit, type = "coefficients",
s = best_lambda)
selected_features
selected_features <- which(selected_features != 0)
selected_features <- selected_features - 1
# Training and Testing Set with only selected features based on LASSO regularisation
# Training
X_train <- X_train[ ,selected_features]
DEFAULT <- Y_train
training_subset <- cbind(X_train, DEFAULT)
training_subset <- as.data.frame(training_subset)
X_train <- as.data.frame(X_train)
# Testing
X_test <- X_test[,selected_features]
X_test <- as.data.frame(X_test)
Y_test <- Y[-train_index, ]
DEFAULT <- Y_test
testing_subset <- cbind(X_test, DEFAULT)
testing_subset <- as.data.frame(testing_subset)
使用这组简化的功能,然后我运行;
#——————————————–
# Logisitic Regression
#——————————————–
# Define the training control
set.seed(q)
train_control <- trainControl(method="cv", number=5)
# Run Logistic Regression model with selected features using q-fold cross-validation
logistic_model <- train(as.factor(DEFAULT) ~ ., data = training_subset,
method = "glm", family = "binomial",
trControl = train_control)
#——————————————–
# SVM Regression
#——————————————–
# Define the training control
set.seed(q)
train_control <- trainControl(method="cv", number=5)
# Build SVM model with radial kernel
svm_model <- train(as.factor(DEFAULT) ~ ., data = training_subset,
method = "svmRadial", trControl = train_control)
#——————————————–
# Decision Tree
#——————————————–
# Create rpart control
set.seed(q)
ctrl <- rpart.control(cp = 0.01, minsplit = 10, maxdepth = 10, xval = 5)
# Build decision tree model with 5-fold cross validation
set.seed(q)
dt_model <- rpart(as.factor(DEFAULT) ~ ., data = training_subset,
method = "class", control = ctrl)
The results I obtain show;
Test Logisitic SVM Decision_Tree
1 Accuracy 0.6666667 0.6585366 0.7235772
2 Precision 0.5909091 0.6229508 0.6666667
3 Recall 0.9122807 0.6666667 0.8070175
> Logisitic
Area under the curve: 0.8025
> SVM
Area under the curve: 0.6591
> Decision_Tree
Area under the curve: 0.8259
基于此,最好的预测模型是 dt,然后是 log,然后是 svm。我想检查我上面所做的在我减少功能的方法和我用来运行相应模型的代码方面是否有意义。
我原以为更复杂的预测模型会表现得更好,即 SVM,然后是 DT,然后是 logisitic。为什么不是基于上述情况,我的方法和使用的代码是否有任何错误,是否有办法使预测性能更好,以便最佳预测模型的顺序为 SVM、逻辑然后是决策树?