请问有没有一个Python函数可以识别大数据集中的布尔值? 有 30 多个列?
受益人摘要档案针对每个成员有多个慢性疾病栏。这些 是布尔字段。 1)将这些列转换为单个分类变量,连接多个 True 诊断。 2)如果会员患有 3 种或以上慢性病,请将其归类为“多种”
这是数据集的链接
这是几个慢性病栏目
SP_ALZHDMTA
SP_CHF
SP_CHRNKIDN
SP_CNCR
SP_慢性阻塞性肺病
SP_抑郁症
SP_糖尿病
SP_ISCHMCHT
SP_OSTEOPRS
SP_RA_OA
SP_STRKETIA
我假设值 2 对应于患病,否则为 1。通过为每种疾病分配唯一的位位置,可以将所有疾病的布尔值连接到单个列中。然后,您可以根据给定行是否存在这些问题来“切换”这些位。然后使用按位 OR (
|
) 运算符连接这些位。同时,您可以在单独的列中统计每一行的疾病数量。
# Define the relevant column names and a unique bit for each illness
col_bits = {
"SP_ALZHDMTA" : 0b10000000000,
"SP_CHF" : 0b01000000000,
"SP_CHRNKIDN" : 0b00100000000,
"SP_CNCR" : 0b00010000000,
"SP_COPD" : 0b00001000000,
"SP_DEPRESSN" : 0b00000100000,
"SP_DIABETES" : 0b00000010000,
"SP_ISCHMCHT" : 0b00000001000,
"SP_OSTEOPRS" : 0b00000000100,
"SP_RA_OA" : 0b00000000010,
"SP_STRKETIA" : 0b00000000001,
}
col_names = col_bits.keys()
# Assume 2 means having the illness
def has_illness(val):
return int(val) == 2
def get_illness_bit(col_name, val):
return col_bits[col_name] if val else 0b00000000000
# A pd Series containing the concatenation of bits representing relevant illnesses
illnesses_bits_col = pd.Series(np.array([0b00000000000 for _ in range(len(df))]))
# A pd Series containing the number of relevant illnesses had by each row
illnesses_counts_col = pd.Series(np.array([0 for _ in range(len(df))]))
for col_name in col_names:
# pd Series containing bool value representations of the current illness `col_name`
illness_col = df[col_name].apply(has_illness)
# concatenate the bit representation of the current illness `col_name`
illness_bit_col = illness_col.apply(lambda x: get_illness_bit(col_name, x))
illnesses_bits_col |= illness_bit_col
# add to counter the current illness `col_name`
illness_count_col = illness_col.apply(lambda x: 1 if x else 0)
illnesses_counts_col += illness_count_col
illnesses_counts_col = illnesses_counts_col.apply(lambda x: "Multiple" if x >= 3 else "-")
print(illnesses_bits_col)
print(illnesses_counts_col)