提高 ExtraTree 回归模型性能

问题描述 投票:0回答:1

我想为下面给出的数据集建立一个回归模型。我尝试了很多方法来消除数据集中异常值对模型性能的影响,但没有成功。当我增加模型的参数范围时,它就会变得过拟合。我可以做什么来建立一个成功的模型?

url = 'https://raw.githubusercontent.com/ramazanunlu/RegressionModel/main/final_data_Jinit%20-%20Kopya.csv'
    df = pd.read_csv(url,sep=";")

MMR=df[['C3','C4', 'C5', 'C6']]
MP=df[['C7', 'C8']]
OMP=df[['C9', 'C10']]
DIF=df[['Age (years)', 'C1', 'C2', 'Male', 'Female','Target']]

MAE=[]
MSE=[]
RMSE=[]
 
results=pd.DataFrame()

for i in range(len(MMR.columns)):
    data=pd.concat([MMR[MMR.columns[i]],MP,OMP,DIF],axis=1)
    data.head()
    
    #X=data.drop([' Kinit'],axis=1)
    X=data.drop(['Target'],axis=1)
    y=data['Target']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
    columns=X_train.columns
    
    sc = StandardScaler()
    X_train[X.columns[0:8]] = sc.fit_transform(X_train[X.columns[0:8]])
    X_test[X.columns[0:8]] = sc.transform (X_test[X.columns[0:8]])
 
    n_estimators = [int(x) for x in np.linspace(start = 100, stop = 600, num = 6)]
    criterion=["squared_error", "absolute_error", "friedman_mse", "poisson"]
    min_samples_split = [2, 5, 10,12]
    min_samples_leaf = [2,4,6,12]
    max_depth = [5,10,15,20]
    max_features = ['sqrt','log2']

    random_grid = {
                   'n_estimators': n_estimators,
                  'criterion':criterion,
                   'max_features': max_features,
                   'max_depth': max_depth,
                   'min_samples_split': min_samples_split,
                   'min_samples_leaf': min_samples_leaf}

    rf = ExtraTreesRegressor()
    rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,scoring='neg_mean_squared_error', n_iter = 50, cv = 5, verbose=1, random_state=42, refit=True)
    
    rf_random.fit(X_train,y_train)
      
    predictions=rf_random.predict(X_test)
    
    #predictions = boxcox_transformer_target.inverse_transform(predictions1.reshape(-1, 1))
    
    MAE.append(metrics.mean_absolute_error(y_test, predictions))
    MSE.append(metrics.mean_squared_error(y_test, predictions))
    RMSE.append(np.sqrt(metrics.mean_squared_error(y_test, predictions)))
    
    
    results=pd.concat([results,pd.DataFrame(predictions)],axis=1,ignore_index=True)
   
regression hyperparameters
1个回答
0
投票

要构建回归模型,我建议您执行以下操作:

  1. 树模型由于其分裂中的工作原理已经包含了特征选择,因此不需要显式地执行特征选择。

  2. 我建议您使用随机森林模型而不是额外树,因为您的数据集大小不是“大数据”。额外树旨在通过在拟合过程中随机化分割来节省处理时间,但代价是引入更多的错误方差.

  3. 执行标准化以避免扩展问题。

  4. 您应该始终检查模型的残差是否呈正态分布。回归假设指出误差是随机的。如果您在错误中发现了模式,那么就存在隐藏的功能。

  5. 根据网格搜索选择超参数,以确保它们优化成本函数。

在您的数据上尝试此代码:

import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
import seaborn as sns

# Assuming df is your DataFrame and 'Target' is the target column
target_column = 'Target'

# Extract features and target
X = df[['C3', 'C7', 'C8', 'C9', 'C10', 'Age (years)', 'C1', 'C2', 'Male', 'Female']]
y = df[target_column]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create a Random Forest Regression model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = rf_model.predict(X_test_scaled)

# Compute R2 score
r2 = r2_score(y_test, y_pred)
print(f'R2 Score: {r2}')

# Error Analysis: Compute residuals
residuals = y_test - y_pred

# Plot residuals vs. predicted values
plt.scatter(y_pred, residuals)
plt.title('Residuals vs. Predicted Values')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()

# Plot residuals distribution
plt.hist(residuals, bins=7)
plt.title('Distribution of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

# Perform a normality test on residuals
statistic, p_value = stats.normaltest(residuals)
print(f'Normality Test p-value: {p_value}')

# Q-Q Plot
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot of Residuals')
plt.show()

# Plot feature importances
feature_importances = rf_model.feature_importances_
sorted_idx = feature_importances.argsort()[::-1]

plt.bar(range(len(feature_importances)), feature_importances[sorted_idx])
plt.xticks(range(len(feature_importances)), X.columns[sorted_idx], rotation=45)
plt.xlabel('Feature')
plt.ylabel('Feature Importance')
plt.title('Random Forest Regression - Feature Importances')
plt.show()

您应该看到类似这样的内容:

残差是正态的,因此回归假设成立。 R2 很差(64%),看来你需要更多的特征来减少预测误差。

该模型似乎是正确的,但还有更多潜在现象尚未建模。

© www.soinside.com 2019 - 2024. All rights reserved.