我用 pandas 数据框创建了一组条件:
'cat_A100 = df['Mr_Diag_Icd10_Code'].str.startswith(('A','B'))
sub_A101 = df['Mr_Diag_Icd10_Code'].str.startswith(tuple([f"A0{i}" for i in range(9)]))
sub_A102 = df['Mr_Diag_Icd10_Code'].str.startswith(tuple([f"A09"]))
sub_A103 = df['Mr_Diag_Icd10_Code'].str.startswith(tuple([f"A{i}" for i in range(15,20)])) | df['Mr_Diag_Icd10_Code'].str.startswith(tuple([f"B90"]))
sub_A104 = df['Mr_Diag_Icd10_Code'].str.startswith(tuple([f"A{i}" for i in range(40,42)]))
sub_A105 = df['Mr_Diag_Icd10_Code'].str.startswith(tuple([f"B24"]))'
并使用条件创建新变量:
'df.loc[cat_A100, 'diagcat'] = 'A100: Certain infectious and parasitic diseases'
df.loc[cat_A100 & sub_A101, 'diagsub'] = 'A101: Intestinal infectious diseases except diarrhoea'
df.loc[cat_A100 & sub_A102, 'diagsub'] = 'A102: Diarrhoea and gastroenteritis of presumed infectious origin'
df.loc[cat_A100 & sub_A103, 'diagsub'] = 'A103: Tuberculosis'
df.loc[cat_A100 & sub_A104, 'diagsub'] = 'A104: Septicaemia'
df.loc[cat_A100 & sub_A105, 'diagsub'] = 'A105: HIV disease'
df.loc[cat_A100 & ~sub_A101 & ~sub_A102 & ~sub_A103 & ~sub_A104 & ~sub_A105, 'diagsub'] = 'A106: Other infectious and parasitic diseases''
有没有办法让我的代码更简洁?我希望创建一个条件元组或列表,然后在创建变量时引用它们(第二组代码)
谢谢!
这是一种潜在的方式吗?或者有更简洁的方法来格式化我的代码吗?
任何建议都有帮助:)
可以使用字典来存储子类别条件及其对应的标签。然后,您可以迭代此字典以应用条件并分配标签。
这种方法减少了重复,也使将来更容易更新或扩展您的条件。
具体操作方法如下:
# == Necessary imports =========================================================
import pandas as pd
# == Create example DataFrame ==================================================
df = pd.DataFrame({'Mr_Diag_Icd10_Code': ['A01', 'A09', 'A10', 'A16', 'A41', 'B24', 'C01']})
# == Define Conditions ==========================================================
# Define the category condition
cat_A100 = df['Mr_Diag_Icd10_Code'].str.startswith(('A', 'B'))
# Dictionary for subcategories and their corresponding conditions
subcat_conditions = {
('A', 'B'): 'A106: Other infectious and parasitic diseases',
tuple(f"A{i:02d}" for i in range(9)): 'A101: Intestinal infectious diseases except diarrhoea',
('A09',): 'A102: Diarrhoea and gastroenteritis of presumed infectious origin',
tuple(f"A{i}" for i in range(15, 20)) + ('B90',): 'A103: Tuberculosis',
tuple(f"A{i}" for i in range(40, 42)): 'A104: Septicaemia',
('B24',): 'A105: HIV disease',
}
# == Create 'diagcat' and 'diagsub' columns ====================================
# Apply the category and subcategory conditions
df.loc[cat_A100, 'diagcat'] = 'A100: Certain infectious and parasitic diseases'
for condition, label in subcat_conditions.items():
df.loc[cat_A100 & df['Mr_Diag_Icd10_Code'].str.startswith(condition), 'diagsub'] = label
您在
subcat_conditions
字典中定义条件的顺序很重要。如果条件重叠,则应首先定义不太具体的条件。例如,如果条件“A106”先于其他条件,则应首先定义它。否则,它应该最后定义。