我正在处理来自 Kaggle (https://www.kaggle.com/joniarroba/noshowappointments) 的医疗保健数据集,其中包含有关巴西医疗预约以及患者是否出现的信息。该数据集包含预约 ID、患者 ID、预约日期和时间、预定日期和时间以及其他几个特征的列。
我想在 DataFrame 中添加一列,根据患者之前的预约显示每次预约的未出现率。例如,如果一位患者进行了三次预约并出现了其中两次,那么他们第四次预约的未出现率将为 1/3。如果患者是第一次预约,则未出现率将为 0.
# convert appointment and scheduled dates to datetime format
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
# create a new column with the time difference between scheduled and
appointment date
df['time_diff'] = (df['AppointmentDay'].dt.date -
df['ScheduledDay'].dt.date).dt.days
# group by PatientId and calculate no-show count and appointment count
for each group
grouped = df.groupby('PatientId')['No-show'].apply(lambda x:
x.eq('Yes').cumsum().shift().fillna(0))
df['no_show_count'] = grouped
df['appointment_count'] = grouped + df.groupby('PatientId').cumcount()
# calculate no-show rate for each patient
df['no_show_rate'] = df['no_show_count'] / df['appointment_count']
# replace NaN values in 'no_show_rate' column with 0
df['no_show_rate'] = df['no_show_rate'].fillna(0)
# print first 5 rows
print(df.head())
the problem is in this code is it calculate current appointment. for
example if you
df[df['PatientId'] ==
112397157856688.0].sort_values('AppointmentDay') ,
you will understand
better what i mean