如何在sklearn中使用我自己的数据集 - Python3 [关闭]

Question

我正在尝试进行线性回归，但我想使用一些

.txt

文件中的我自己的数据。我有一些包含 3 列的表格的数据。

然后，我想知道如何更改以下代码，这是来自 http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

的示例

然后我对之前示例中的代码进行了一些更改，并发明了一些数据，这是正确的方法吗？就像这样使用一些

和

。然后我还想知道在等式：

x_train = x [:2]

中，

[:2]

对我的程序有什么影响。我真的没听懂这部分。

from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score

#X has to be numpy array not list.

x=([0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10])
y=[5,3,8,3,4,5,5,7,8,9,10]

x_train = x [:2]
x_test = x [2:]

y_train = y[:2]
y_test = y[2:]

regr = linear_model.LinearRegression()
regr.fit (x_train,y_train)

y_pred = regr.predict(x_test)

#coefficient
print('Coefficients: \n', regr.coef_)

#the mean square error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

plt.scatter(x_test, y_test,  color='black')
plt.plot(x_test, y_pred, color='blue', linewidth=3)
plt.axis([0, 20, 0, 20])
plt.show()

编辑1

借助我在此网页中收到的帮助，我尝试编写一些代码，以生成我自己的数据的拟合，但我无法获得正确的拟合，所以如果有人有时间帮助我更多或告诉我我是否做错了什么。

我在收到的图片中使用的代码

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

data = pd.read_csv('data.txt')
#x = data[['col1','col2']]
x = data[['col1']]
y = data['col3']

#convert to array to fit the model
x=np.asarray(x)
y=np.asarray(y)

# define the KFolds 
kf = KFold(n_splits=2)

#define the model
regr = linear_model.LinearRegression()

# use cross validation and return the r2 score for each Fold 
#if you want to return other scores than r2, just change the scoring in cross_val_score
scores = cross_val_score(regr, x, y, cv= kf, scoring= 'r2')

print(scores)

for train_index, test_index in kf.split(x):
  print("TRAIN:", train_index, "TEST:", test_index)
  X_train, X_test = x[train_index], x[test_index]
  y_train, y_test = y[train_index], y[test_index]


plt.scatter (X_test, y_test)
plt.show()

我在这里放了一张看起来像我的数据以及我从训练和测试中获得的数据的图片

然后我做了一些适合的程序，但我不确定它是否正确：

regr.fit (X_train, y_train)
y_pred = regr.predict(X_test)
print(y_pred)
plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.show()

我感觉完全奇怪。

我不明白为什么我会得到它，如果当我使用 MINUIT 执行此操作时，我的配合有效。所以，如果有人有一些提示可以帮助我。

为什么程序显然没有使用“y”中的我的数据来进行训练或测试样本？

我的数据可以在这里获取：https://www.dropbox.com/sh/nbbsc0fqznkwxvt/AAD-u6lM4orJOGrgIyz0o8B9a?dl=0

对我来说唯一重要的是 col1 和 col3，col2 应该被忽略。然后我想对这些数据进行拟合并提取拟合值。我知道这是一条适合该数据的线。

Answer 1

首先，要拆分数据并使用一部分数据来训练模型，另一部分数据来评估模型，主要原因是为了避免过度拟合。通常，我们使用KFolds或LOO（留一）来进行交叉验证。

这是一个使用 30 个样本、3 个变量并与 KFolds 进行交叉验证的示例。

import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn import linear_model

#create artificial data with 30 lines (samples) and 3 columns (variables)
x = np.random.rand(30,3)

#create the target variable y
y = range(30)

# convert the list to numpy array (this is needed for fit method of sklearn)
y = np.asarray(y)

# define the KFolds (3 folds in this example)
kf = KFold(n_splits=3)

#define the model
regr = linear_model.LinearRegression()

# use cross validation and return the r2 score for each Fold (here we have 3). 
#if you want to return other scores than r2, just change the scoring in cross_val_score.
scores = cross_val_score(regr, x, y, cv= kf, scoring= 'r2')

print(scores)

结果：

在这里您可以看到模型每次折叠的 r2 分数。因此我们将数据分割 3 次，并使用 3 个不同的训练数据来获取这些值。这是由 sklearn 在 cross_val_score 方法中自动完成的。

 array([-30.36184326,  -0.4149778 , -28.89110233])

要了解 KFold 的作用，您可以使用以下方法打印训练和测试索引：

for train_index, test_index in kf.split(x):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = x[train_index], x[test_index]
   y_train, y_test = y[train_index], y[test_index]

结果：

('TRAIN:', array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
   27, 28, 29]), 'TEST:', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))
('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 20, 21, 22, 23, 24, 25, 26,
   27, 28, 29]), 'TEST:', array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]))
('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19]), 'TEST:', array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29]))

现在，您可以看到，对于第一次折叠，我们使用了示例：

10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29.

接下来，对于第二次折叠，我们使用了示例：

0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29.

注意：这些数字是 x 数据的索引。例如。 2 表示第 3 个样本（行）。在Python中我们从0开始计数。正如您所看到的，我们不会在每个 Fold 中使用完全相同的数据（样本）。

希望这有帮助。

编辑1

回答您关于加载txt数据的问题。假设您有一个包含 3 列的 txt 文件。前 2 列是特征，最后一列是 y（目标）。

在这种情况下，您可以使用 pandas 执行以下操作：

import pandas as pd
import numpy as np

data = pd.read_csv('data.txt')
x = data[['col1','col2']]
y = data['col3']

#convert to array to fit the model
x=np.asarray(x)
y=np.asarray(y)

txt在这里：https://ufile.io/eb5xl（选择慢速下载）。

编辑2

这仅用于可视化目的。我不分割数据。我使用所有数据来拟合模型，然后根据相同的数据进行预测。然后我绘制预测值。

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn import linear_model
import matplotlib.pyplot as plt

data = pd.read_csv('data.txt')

x = data[['col1']]
y = data['col3']

#convert to array to fit the model
x=np.asarray(x)
y=np.asarray(y)

regr = linear_model.LinearRegression()
regr.fit(x, y)

y_predicted = regr.predict(x)

plt.scatter(x, y,  color='black')
plt.plot(x, y_predicted, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

结果：

数据似乎不遵循线性模式。应使用其他模型（例如指数拟合）

Answer 2

将你的文本文件转为

CSV

，即具有以下结构的文本文件：

col1name;col2name;col3name
1;0;2
4;6;8
0;1;3

然后，将数据集放入

pandas.DataFrame

中，使用除最后一列之外的所有列作为 X 数据集，最后一列作为 Y 数据集。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

file_path = "path/to/your/file.csv"

data = pd.read_csv(filepath_or_buffer = file_path, sep = ";")

X_train, X_test, y_train, y_test = train_test_split(data.loc[:, data.columns != "name_of_your_last_column"], data.loc[:, data.columns == "name_of_your_last_column"], test_size = 0.25)

classifier = LinearRegression()
classifier.fit(X_train, y_train)

print("Train set score :", classifier.score(X_train, y_train)
print("Test set score :", classifier.score(X_test, y_test))

请注意，这是一个最小的示例，您应该使用交叉验证过程来避免过度拟合，并使用网格搜索过程来避免欠拟合。

如何在sklearn中使用我自己的数据集 - Python3 [关闭]

问题描述投票：0回答：2

2个回答

最新问题

如何在sklearn中使用我自己的数据集 - Python3 [关闭]

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2