这是我的数据头:
logger year avg_max_temp avg_min_temp tot_precipitation yield
0 072.txt 1985-01-01 15.37 4.33 77.43 225447
1 187.txt 1985-01-01 19.24 7.88 146.40 225447
2 338.txt 1985-01-01 14.43 2.97 95.16 225447
3 280.txt 1985-01-01 16.98 6.51 114.02 225447
4 436.txt 1985-01-01 17.13 6.78 124.63 225447
... ... ... ... ... ... ...
4786 552.txt 2014-01-01 13.60 3.29 88.02 361091
4787 769.txt 2014-01-01 15.17 2.11 89.00 361091
4788 822.txt 2014-01-01 13.49 2.37 82.22 361091
4789 830.txt 2014-01-01 17.09 4.31 84.66 361091
4790 312.txt 2014-01-01 14.70 2.88 99.43 361091
我的PI刚刚让我用建模的方式考察target(yield)和三个数值特征之间的关系。请注意,每年只有一个产量值,但每年有 167 个气象站的天气观测值。我将其视为时间序列分析并这样做:
df['year'] = pd.to_datetime(df['year'], format='%Y')
df = df.set_index('year')
#Set aside an 8 year testing section
train = df.loc[df.index < '2006-01-01']
test = df.loc[df.index >= '2006-01-01']
#Create training and testing features
features = ['avg_max_temp', 'avg_min_temp', 'tot_precipitation']
target = 'yield'
X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]
# Create and score model
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
rf.score(X_test, y_test)
不幸的是,这给出了结果:-10.55。我相信 Random Forest 在 SkLearn 中的分数是 R-Squared,所以这里肯定出了什么问题。任何关于出了什么问题的想法都将不胜感激。提前致谢。