我最近开始使用 python 进行机器学习。下面是我作为示例选取的数据集以及我迄今为止所处理的代码。选择[2000...2015]作为测试数据和训练数据[2016,2017]。
Dataset
Years Values
0 2000 23.0
1 2001 27.5
2 2002 46.0
3 2003 56.0
4 2004 64.8
5 2005 71.2
6 2006 80.2
7 2007 98.0
8 2008 113.0
9 2009 155.8
10 2010 414.0
11 2011 2297.8
12 2012 3628.4
13 2013 16187.8
14 2014 25197.8
15 2015 42987.8
16 2016 77555.5
17 2017 130631.9
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
df = pd.DataFrame([[i for i in range(2000,2018)],
[23.0,27.5,46.0,56.0,64.8,71.2,80.2,98.0,113.0,155.8,414.0,2297.8,3628.4,16187.8,25197.8,42987.8,77555.5,130631.9]])
df = df.T
df.columns = ['Years', 'Values']
上面的代码创建了DataFrame。另一件需要记住的重要事情是我的
Years
列是一个时间序列,而不仅仅是一个连续值。我没有做任何改变来适应这个。
我想要拟合非线性模型,这可能有助于并打印绘图,就像我为线性模型示例所做的那样。这是我使用线性模型尝试过的。另外,在我自己的示例中,我似乎没有考虑到我的
Years
列是一个时间序列而不是连续的事实。
一旦我们有了模型,希望用它来预测至少未来几年的值。
X = df.iloc[:, :-1].values
y = df.iloc[:, 1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0, shuffle = False)
lm = LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, lm.predict(X_train), color = 'blue')
plt.title('Years vs Values (training set)')
plt.xlabel('Years')
plt.ylabel('Values')
plt.show()
试试这个。您也可以打印预测值。 预计5年。
import numpy.polynomial.polynomial as poly
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame([[i for i in range(2000,2018)],
[23.0,27.5,46.0,56.0,64.8,71.2,80.2,98.0,113.0,155.8,414.0,2297.8,3628.4,16187.8,25197.8,42987.8,77555.5,130631.9]])
df = df.T
df.columns = ['Year', 'Values']
df['Year'] = df['Year'].astype(int)
df['Values'] = df['Values'].astype(int)
no_of_predictions = 5
X = np.array(df.Year, dtype = float)
y = np.array(df.Values, dtype = float)
Z = [2019,2020,2021,2022]
coefs = poly.polyfit(X, y, 4)
X_new = np.linspace(X[0], X[-1]+no_of_predictions, num=len(X)+no_of_predictions)
ffit = poly.polyval(X_new, coefs)
pred = poly.polyval(Z, coefs)
predictions = pd.DataFrame(Z,pred)
print(predictions)
plt.plot(X, y, 'ro', label="Original data")
plt.plot(X_new, ffit, label = "Fitted data")
plt.legend(loc='upper left')
plt.show()
编辑:我的答案是错误的,我已经习惯了分类器而不是回归器;不删除它是因为我害怕自己被禁止发布更多答案。不要使用这个答案。
试试这个
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
df = pd.DataFrame([[i for i in range(2000,2018)],
[23.0,27.5,46.0,56.0,64.8,71.2,80.2,98.0,113.0,155.8,414.0,2297.8,3628.4,16187.8,25197.8,42987.8,77555.5,130631.9]])
df = df.T
df.columns = ['Year', 'Values']
df['Year'] = df['Year'].astype(int)
df['Values'] = df['Values'].astype(int)
你的数据框
X = df[['Year']]
y = df[['Values']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0, shuffle = False)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
clf = RandomForestClassifier(n_estimators=10)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, clf.predict(X_train), color = 'blue')
plt.title('Years vs Values (training set)')
plt.xlabel('Years')
plt.xticks(rotation=90)
plt.ylabel('Values')
plt.show()
同时我也尝试过
import numpy.polynomial.polynomial as poly
X = np.array(df.Years, dtype = float)
y = np.array(df.Values, dtype = float)
coefs = poly.polyfit(X, y, 4)
X_new = np.linspace(X[0], X[-1], num=17)
ffit = poly.polyval(X_new, coefs)
plt.plot(X, y, 'ro', label="Original data")
plt.plot(X_new, ffit, label = "Fitted data")
plt.legend(loc='upper left')
plt.show()
它确实几乎完美贴合。但现在我不清楚如何使用这些来预测未来五年的价值。