正在使用来自 Kaggle (https://www.kaggle.com/joniarroba/noshowappointments) 的医疗保健数据集,其中包含有关巴西医疗预约以及患者是否出现的信息。该数据集包含预约 ID、患者 ID、预约日期和时间、预定日期和时间以及其他几个特征的列。
我想根据患者之前的预约计算每次预约的缺席率。例如,如果一位患者进行了三次预约并出现了其中两次,那么他们第四次预约的未出现率将为 1/3。如果患者是第一次预约,则未出现率将为 0.
我尝试使用以下代码,但它计算的是当前出现或未出现率,而不是基于以前约会的未出现率:
# convert appointment and scheduled dates to datetime format
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
# sort the DataFrame by PatientId and AppointmentDay
df = df.sort_values(['PatientId', 'AppointmentDay'])
# create a new column with the time difference between scheduled and appointment date
df['time_diff'] = (df['AppointmentDay'].dt.date - df['ScheduledDay'].dt.date).dt.days
# create a new column to store the no-show rate for each appointment
df['no_show_rate'] = 0
# loop through each row of the DataFrame
for index, row in df.iterrows():
# get the PatientId and AppointmentDay for the current row
patient_id = row['PatientId']
appointment_day = row['AppointmentDay']
# select all previous appointments for the current patient
previous_appointments = df.loc[(df['PatientId'] == patient_id) & (df['AppointmentDay'] < appointment_day)]
# calculate the no-show rate based on the previous appointments
appointment_count = len(previous_appointments)
no_show_count = len(previous_appointments.loc[previous_appointments['No-show'] == 'Yes'])
if appointment_count > 0:
no_show_rate = no_show_count / appointment_count
else:
no_show_rate = 0
# update the 'no_show_rate' column for the current row
df.at[index, 'no_show_rate'] = no_show_rate
# print first 5 rows
print(df.head())
试试这个……我认为 pandas 的列式方法更简单。为了根据 previous 约会获得未出现率,我们将在 PatientId 数据框的 Groupby 上使用 pandas.DataFrame.shift 来获得 lagged 值:
import pandas as pd
#import data
dat = ...
#sort
dat = dat.sort_values(['PatientId', 'AppointmentDay'])
#create cumcount for total appointments for ea patient. This iterates on the
#index starting at 0, so if we want the first appointment to show 1 instead
#of 0, second 2 instead of 1... then we need to add 1
dat['appt_ncum'] = dat.groupby(['PatientId']).cumcount() + 1
#for this purpose, it's better to recode 'No-show' to Boolean (1/0)
#in a temp column and do cumsum on that
dat['cond'] = dat['No-show'] == 'Yes'
dat['noshow_ncum'] = dat.groupby('PatientId').cond.cumsum()
dat = dat.drop(['cond'], axis = 1)
#Calculate no-show proportion based on Patient's previous appointment(s)
#using DataFrame.shift as described above to lag numerator and denominator
dat['noshow_cum_prop'] = dat.groupby(['PatientId'])['noshow_ncum'].shift(1) / dat.groupby(['PatientId'])['appt_ncum'].shift(1)
#A few patients you can check to make sure metrics are calculated correctly
#dat[['PatientId', 'AppointmentDay', 'No-show' ,'appt_ncum', 'noshow_ncum', 'noshow_cum_prop']][dat['PatientId'] == 476861615941]
#dat[['PatientId', 'AppointmentDay', 'No-show' ,'appt_ncum', 'noshow_ncum', 'noshow_cum_prop']][dat['PatientId'] == 1421991592826]
#dat[['PatientId', 'AppointmentDay', 'No-show' ,'appt_ncum', 'noshow_ncum', 'noshow_cum_prop']][dat['PatientId'] == 933789553426785]
#dat[['PatientId', 'AppointmentDay', 'No-show' ,'appt_ncum', 'noshow_ncum', 'noshow_cum_prop']][dat['PatientId'] == 416755661551767]