我有样本房价数据和简单代码:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
data = pd.read_csv('house_price_4.csv')
df = pd.DataFrame(data)
df['Area'] = df['Area'].str.replace(',', '')
df = df.dropna()
# Encoding the categorical feature 'Address'
df['Address'] = df['Address'].astype('category').cat.codes
df['Parking'] = df['Parking'].replace({True: 1, False: 0})
df['Warehouse'] = df['Warehouse'].replace({True: 1, False: 0})
df['Elevator'] = df['Elevator'].replace({True: 1, False: 0})
X = df.drop(columns=['Price(USD)','Price'])
y = df['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r_squared = r2_score(y_test, y_pred)
print(f'R^2 Score: {r_squared:.4f}')
我的 R2 分数非常低:0.34
如何获得更高的 R2 分数?
这是我的示例数据:https://drive.google.com/file/d/14Se90XbGJivftq3_VrtgRSalkCplduVX/view?usp=sharing
除了线性回归之外,您还可以使用其他模型来测试是否可以对数据进行建模。顺便说一句,R² 并不是使用线性回归的最大问题。使用我的答案来研究两种情况下的残差图,因为假设线性回归的残差清楚地暗示了异方差性。在这里查看比较:
fig, axs = plt.subplots(nrows = 1, ncols = 2) # define subplots
###################################################################################
lrModel = LinearRegression() # random forest
lrModel.fit(XTrain, yTrain) # fit
lryPred = lrModel.predict(XTest) # test
lrRMSE = mean_squared_error(yTest, lryPred, squared=False) # RMSE
lrR2 = r2_score(yTest, lryPred) # R2
axs[0].scatter(lryPred, yTest) # scatter
axs[0].set_title("Linear Regression\nR² = "+str(round(lrR2,2))+"; RMSE = "+str(round(lrRMSE)))
###################################################################################
dtModel = DecisionTreeRegressor(random_state=42) # decision tree
dtModel.fit(XTrain, yTrain) # fit
dtyPred = dtModel.predict(XTest) # test
dtRMSE = mean_squared_error(yTest, dtyPred, squared=False) # RMSE
dtR2 = r2_score(yTest, dtyPred) # R2
axs[1].scatter(dtyPred, yTest) # scatter
axs[1].set_title("Decision Tree Regressor\nR² = "+str(round(dtR2,2))+"; RMSE = "+str(round(dtRMSE)))
结果是这样的:
线性回归的选择从一开始就是错误的。预测也呈负数。使用决策树或随机森林,它们应该给出相似的拟合。