我在 Python Pandas 中有数据框,如下所示:
输入数据:
df = pd.DataFrame({
'id' : [999, 999, 999, 185, 185, 185, 44, 44, 44],
'target' : [1, 1, 1, 0, 0, 0, 1, 1, 1],
'event_date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
'event1': [1, 6, 11, 16, np.nan, 22, 74, 109, 52],
'event2': [2, 7, np.nan, 17, 22, np.nan, np.nan, 10, 5],
'event3': [3, 8, 13, 18, 23, np.nan, 2, np.nan, 99],
'event4': [4, 9, np.nan, np.nan, np.nan, 11, 8, np.nan, np.nan],
'event5': [5, np.nan, 15, 20, 25, 1, 1, 3, np.nan]
})
# Wypełnienie brakujących wartości zerami
df = df.fillna(0)
df
要求:
我的真实数据集当然有更多的数据,但我需要根据以下要求将我的数据集划分为 2 个单独的数据集(训练和测试):
所需结果的示例(当然在实际数据中应该是唯一ID的比例70%/30%):
训练数据集:
df = pd.DataFrame({
'id' : [999, 999, 999, 185, 185, 185],
'target' : [1, 1, 1, 0, 0, 0],
'event_date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
'event1': [1, 6, 11, 16, np.nan, 22],
'event2': [2, 7, np.nan, 17, 22, np.nan],
'event3': [3, 8, 13, 18, 23, np.nan],
'event4': [4, 9, np.nan, np.nan, np.nan, 11],
'event5': [5, np.nan, 15, 20, 25, 1]
})
df = df.fillna(0)
df
测试数据集:
df = pd.DataFrame({
'id' : [44, 44, 44],
'target' : [1, 1, 1],
'event_date': ['2023-01-01', '2023-01-02', '2023-01-03'],
'event1': [74, 109, 52],
'event2': [ np.nan, 10, 5],
'event3': [2, np.nan, 99],
'event4': [8, np.nan, np.nan],
'event5': [1, 3, np.nan]
})
# Wypełnienie brakujących wartości zerami
df = df.fillna(0)
df
删除重复项,
sample
,然后用它来分割数据:
keep = df['id'].drop_duplicates().sample(frac=0.7)
m = df['id'].isin(keep)
train = df[m]
test = df[~m]