如何提高随机森林分类器的准确率?

问题描述 投票:0回答:1

我有一个森林分类器。其准确率约为 61%。我想尝试提高准确性,但我已经尝试过的并没有大大提高准确性。代码如下所示:

# importing time module to record the time of running the program
import time
begin_time = time.process_time()

# importing modules
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

# we will use random forest classifier as our classifier
logistic_regression = LogisticRegression()
forest_classifier = RandomForestClassifier(max_depth=4, random_state=0)

# reading in accelerometer data
time_train = pd.read_csv("https://courses.edx.org/assets/courseware/v1/b98039c3648763aae4f153a6ed32f38b/asset-v1:HarvardX+PH526x+3T2022+type@asset+block/train_time_series.csv", index_col=0)
labels_train = pd.read_csv("https://courses.edx.org/assets/courseware/v1/d64e74647423e525bbeb13f2884e9cfa/asset-v1:HarvardX+PH526x+3T2022+type@asset+block/train_labels.csv", index_col=0)
time_test = pd.read_csv("https://courses.edx.org/assets/courseware/v1/1ca4f3d4976f07b8c4ecf99cf8f7bdbc/asset-v1:HarvardX+PH526x+3T2022+type@asset+block/test_time_series.csv", index_col=0)
labels_test = pd.read_csv("https://courses.edx.org/assets/courseware/v1/72d5933c310cf5eac3fa3f28b26d9c39/asset-v1:HarvardX+PH526x+3T2022+type@asset+block/test_labels.csv", index_col=0)

# making lists out of the x, y, z columns
x, y, z = time_train.iloc[3::10][['x', 'y', 'z']].T.values
labels_train[['x', 'y', 'z']] = np.stack([x, y, z], axis=1)

# doing the same with the test dataframe
x1, y1, z1 = time_test.iloc[9::10][['x', 'y', 'z']].T.values
labels_test[['x', 'y', 'z']] = np.stack([x1, y1, z1], axis=1)
labels_test.head(50)

# plotting the results on 3D graph

%matplotlib notebook
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(x, y, z, c=y) # to plot a scatter plot

ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("z")

# now splitting the dataframe into train (75%) and test data (25%) with random_state=1
X = labels_train[['x', 'y', 'z']]
y = labels_train['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=1)

# now choosing the best classifier. The code is based on the Case Study 7 Part 2
def correlation(estimator, X, y):
    predictions = estimator.fit(X, y).predict(X)
    return r2_score(y, predictions)

def accuracy(estimator, X, y):
    predictions = estimator.fit(X, y).predict(X)
    return accuracy_score(y, predictions)

regression_outcome = labels_train['label']
classification_outcome = labels_train['label']
covariates = labels_train[['x', 'y', 'z']]

logistic_regression_scores = cross_val_score(logistic_regression, covariates, classification_outcome, cv=10, scoring=accuracy)
forest_classification_scores = cross_val_score(forest_classifier, covariates, classification_outcome, cv=10, scoring=accuracy)

plt.axes().set_aspect('equal', 'box')
plt.scatter(logistic_regression_scores, forest_classification_scores)
plt.plot((0, 1), (0, 1), 'k-')

plt.xlim(0, 1)
plt.ylim(0, 1)
plt.xlabel("Logistic Regression Score")
plt.ylabel("Forest Classification Score")

plt.show()

np.mean(forest_classification_scores)

# tuning in Random Forest. The idea is taken from Katarina Pavlović - Predicting the type of physical activity from tri-axial smartphone accelerometer data
from sklearn.model_selection import RandomizedSearchCV

estimators = [] # the number of trees in our random forest
for x in range(100, 1001, 10):
    estimators.append(int(x))

max_features=['auto', 'sqrt'] # Number of features to consider at every split

# Maximum number of levels in tree
max_depth = []
for x in range(3, 31):
    max_depth.append(int(x))
max_depth.append(None)
print(max_depth)

# Minimum number of samples required to split a node
min_samples_split=[2, 5, 10]

# Minimum number of samples required at each leaf
min_samples_leaf=[1, 2, 3]

# Method of selecting samples for training each tree
bootstrap=[True, False]


random_grid = {'n_estimators': estimators, 'max_features': max_features,
               'max_depth': max_depth, 'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf, 'bootstrap': bootstrap}



# Find the best parameters for the Random Forest Classifer (for a better fit)
# and the score if those paremeters were used
rf_random = RandomizedSearchCV(estimator=forest_classifier, param_distributions=random_grid, n_iter =100, cv=3, verbose=2, random_state=1)
rf_random.fit(covariates, classification_outcome)
Best_params = rf_random.best_params_
print(Best_params)
print(rf_random.best_score_)


forest_classifier= RandomForestClassifier(n_estimators=300, min_samples_split=10, min_samples_leaf=3, max_features='sqrt', max_depth=20, bootstrap=True)


# Calculate the accuracy of the classifer on the test set created in Section B1.2
forest_classifier.fit(X_train, y_train)
forest_predictions = forest_classifier.predict(X_test)
accuracy_score(y_test, forest_predictions)

我尝试使用 RandomizedSearchCV,但它并没有多大帮助。

我知道有人做过同样的题,准确率达到了80%。

我无法添加更多数据。更具体一点——数据是从数据库中取出来的,这里就不多说数据了,与主题相关

此外,我没有任何缺失或异常数据。

此外,我尝试使用 Gradient Boosting Classifier,但它给我的准确率为 100%,但据我了解,它并不准确。这是代码:

from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=100,max_depth=5)

# fit the model with the training data
model.fit(X_train,y_train)

# predict the target on the train dataset
predict_train = model.predict(X_train)
print('\nTarget on train data',predict_train)

# Accuray Score on train dataset
accuracy_train = accuracy_score(y_train,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)

你能给我一些建议吗?

更新:我已经尝试过答案中建议的方法(见下文)。也许还有其他方法?

python pandas classification random-forest
1个回答
0
投票

本网站列出了8种方法(详情请看那里):

  1. 添加更多数据
  2. 处理缺失值和离群值
  3. 特征工程
  4. 特征选择
  5. 多种算法
  6. 算法调优
  7. 集成方法
  8. 交叉验证
© www.soinside.com 2019 - 2024. All rights reserved.