如何使用 Python 和 Pandas 根据患者之前在医疗保健数据集中的预约来计算预约未出现率？

Question

正在使用来自 Kaggle (https://www.kaggle.com/joniarroba/noshowappointments) 的医疗保健数据集，其中包含有关巴西医疗预约以及患者是否出现的信息。该数据集包含预约 ID、患者 ID、预约日期和时间、预定日期和时间以及其他几个特征的列。

我想根据患者之前的预约计算每次预约的缺席率。例如，如果一位患者进行了三次预约并出现了其中两次，那么他们第四次预约的未出现率将为 1/3。如果患者是第一次预约，则未出现率将为 0.

我尝试使用以下代码，但它计算的是当前出现或未出现率，而不是基于以前约会的未出现率：

# convert appointment and scheduled dates to datetime format
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])

# sort the DataFrame by PatientId and AppointmentDay
df = df.sort_values(['PatientId', 'AppointmentDay'])

# create a new column with the time difference between scheduled and appointment date
df['time_diff'] = (df['AppointmentDay'].dt.date - df['ScheduledDay'].dt.date).dt.days

# create a new column to store the no-show rate for each appointment
df['no_show_rate'] = 0

# loop through each row of the DataFrame
for index, row in df.iterrows():
    # get the PatientId and AppointmentDay for the current row
    patient_id = row['PatientId']
    appointment_day = row['AppointmentDay']
    
    # select all previous appointments for the current patient
    previous_appointments = df.loc[(df['PatientId'] == patient_id) & (df['AppointmentDay'] < appointment_day)]
    
    # calculate the no-show rate based on the previous appointments
    appointment_count = len(previous_appointments)
    no_show_count = len(previous_appointments.loc[previous_appointments['No-show'] == 'Yes'])
    if appointment_count > 0:
        no_show_rate = no_show_count / appointment_count
    else:
        no_show_rate = 0
    
    # update the 'no_show_rate' column for the current row
    df.at[index, 'no_show_rate'] = no_show_rate
    
# print first 5 rows
print(df.head())

Answer 1

试试这个……我认为 pandas 的列式方法更简单。为了根据 previous 约会获得未出现率，我们将在 PatientId 数据框的 Groupby 上使用 pandas.DataFrame.shift 来获得 lagged 值：

import pandas as pd

#import data
dat = ...

#sort
dat = dat.sort_values(['PatientId', 'AppointmentDay'])

#create cumcount for total appointments for ea patient. This iterates on the
#index starting at 0, so if we want the first appointment to show 1 instead
#of 0, second 2 instead of 1... then we need to add 1
dat['appt_ncum'] = dat.groupby(['PatientId']).cumcount() + 1

#for this purpose, it's better to recode 'No-show' to Boolean (1/0)
#in a temp column and do cumsum on that
dat['cond'] = dat['No-show'] == 'Yes'
dat['noshow_ncum'] = dat.groupby('PatientId').cond.cumsum()
dat = dat.drop(['cond'], axis = 1)

#Calculate no-show proportion based on Patient's previous appointment(s)
#using DataFrame.shift as described above to lag numerator and denominator

dat['noshow_cum_prop'] = dat.groupby(['PatientId'])['noshow_ncum'].shift(1) / dat.groupby(['PatientId'])['appt_ncum'].shift(1)

#A few patients you can check to make sure metrics are calculated correctly
#dat[['PatientId', 'AppointmentDay', 'No-show' ,'appt_ncum', 'noshow_ncum', 'noshow_cum_prop']][dat['PatientId'] == 476861615941]

#dat[['PatientId', 'AppointmentDay', 'No-show' ,'appt_ncum', 'noshow_ncum', 'noshow_cum_prop']][dat['PatientId'] == 1421991592826]

#dat[['PatientId', 'AppointmentDay', 'No-show' ,'appt_ncum', 'noshow_ncum', 'noshow_cum_prop']][dat['PatientId'] == 933789553426785]

#dat[['PatientId', 'AppointmentDay', 'No-show' ,'appt_ncum', 'noshow_ncum', 'noshow_cum_prop']][dat['PatientId'] == 416755661551767]

如何使用 Python 和 Pandas 根据患者之前在医疗保健数据集中的预约来计算预约未出现率？

问题描述投票：0回答：1

1个回答

最新问题

如何使用 Python 和 Pandas 根据患者之前在医疗保健数据集中的预约来计算预约未出现率？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1