我正在尝试解决机器学习问题,如果一个人是否会交付订单。高度不平衡的数据集。这是我的数据集的一瞥
[{'order_id': '1bjhtj', 'Delivery Guy': 'John', 'Target': 0},
{'order_id': '1aec', 'Delivery Guy': 'John', 'Target': 0},
{'order_id': '1cgfd', 'Delivery Guy': 'John', 'Target': 0},
{'order_id': '1bceg', 'Delivery Guy': 'Tom', 'Target': 0},
{'order_id': '1a2fg', 'Delivery Guy': 'Tom', 'Target': 0},
{'order_id': '1cbsf', 'Delivery Guy': 'Tom', 'Target': 1},
{'order_id': '1bc5', 'Delivery Guy': 'Jay', 'Target': 0},
{'order_id': '1a22', 'Delivery Guy': 'Jay', 'Target': 0},
{'order_id': '1bzc5', 'Delivery Guy': 'Jay', 'Target': 0},
{'order_id': '1av22', 'Delivery Guy': 'Jay', 'Target': 0},
{'order_id': '1bsc5', 'Delivery Guy': 'Jay', 'Target': 1},
{'order_id': '1a2t2', 'Delivery Guy': 'Jay', 'Target': 0},
{'order_id': '1bc5b', 'Delivery Guy': 'Jay', 'Target': 0},
{'order_id': '1a22a', 'Delivery Guy': 'Mary', 'Target': 0},
{'order_id': '1c5bv', 'Delivery Guy': 'Mary', 'Target': 0},
{'order_id': 'vb2er', 'Delivery Guy': 'Mary', 'Target': 0},
{'order_id': '1bs5s', 'Delivery Guy': 'Mary', 'Target': 0},
{'order_id': '1a22n', 'Delivery Guy': 'Mary', 'Target': 0},
{'order_id': '122a', 'Delivery Guy': 'James', 'Target': 1},
{'order_id': '1cw5bv', 'Delivery Guy': 'James', 'Target': 0},
{'order_id': 'vb=er', 'Delivery Guy': 'James', 'Target': 0},
{'order_id': '1b5s', 'Delivery Guy': 'James', 'Target': 0},
{'order_id': '1a2n', 'Delivery Guy': 'James', 'Target': 1}]
这是我的桌子:
| order_id | Delivery Guy | Target |
|----------|--------------|--------|
| 1bjhtj | John | 0 |
| 1aec | John | 0 |
| 1cgfd | John | 0 |
| 1bceg | Tom | 0 |
| 1a2fg | Tom | 0 |
| 1cbsf | Tom | 1 |
| 1bc5 | Jay | 0 |
| 1a22 | Jay | 0 |
| 1bzc5 | Jay | 0 |
| 1av22 | Jay | 0 |
| 1bsc5 | Jay | 1 |
| 1a2t2 | Jay | 0 |
| 1bc5b | Jay | 0 |
| 1a22a | Mary | 0 |
| 1c5bv | Mary | 0 |
| vb2er | Mary | 0 |
| 1bs5s | Mary | 0 |
| 1a22n | Mary | 0 |
| 122a | James | 1 |
| 1cw5bv | James | 0 |
| vb=er | James | 0 |
| 1b5s | James | 0 |
| 1a2n | James | 1 |
我希望我的机器学习模型能够理解每个人的属性并预测这两个
案例: 将提供“0”并且 不会交付“1”
我想以这样的方式分割我的训练和测试,使其保留几行名称和几行目标类,以便它学习所有模式。
我目前就用过这个
X = df.drop(columns = "Target")
y = df.Target
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.7,stratify=y)
它确实给了我每个送货员的输出,但它错过了我们可以分割“詹姆斯”的部分,这样“1”将在训练中,另一个“1”将在测试中。 谁能帮助我以不同的方式解决这个问题。
这里有一种方法可以确保:
每个
"Delivery Guy"
都在训练和测试集中都有代表。
每个 "Target" class
在两个集合中都得到了充分的体现。
将每个“送货员”的两个“目标”类别分配给两个集合后,将这些分配组合回最终的训练和测试集。
以下是如何在 Python 中实现此功能:
import pandas as pd
from sklearn.model_selection import train_test_split
# Initialize empty DataFrames for train and test sets
train = pd.DataFrame(columns=df.columns)
test = pd.DataFrame(columns=df.columns)
# Split the dataset by 'Delivery Guy' and ensure each one is represented in both sets
for name, group in df.groupby('Delivery Guy'):
# For each 'Delivery Guy', further ensure each 'Target' class is represented in both sets
for target, target_group in group.groupby('Target'):
# Here we decide the split size; you might adjust the test_size based on your dataset's characteristics
target_train, target_test = train_test_split(target_group, test_size=0.5, random_state=42)
# Append these splits to the train and test DataFrames
train = train.append(target_train)
test = test.append(target_test)
# Now 'train' and 'test' DataFrames should have a more balanced representation