在 Python 中识别大型数据集中的布尔值

问题描述 投票:0回答:1

请问有没有一个Python函数可以识别大数据集中的布尔值? 有 30 多个列?

受益人摘要档案针对每个成员有多个慢性疾病栏。这些 是布尔字段。 1)将这些列转换为单个分类变量,连接多个 True 诊断。 2)如果会员患有 3 种或以上慢性病,请将其归类为“多种”

这是数据集的链接

https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/Downloads/DE1_0_2009_Beneficiary_Summary_File_Sample_20.zip

这是几个慢性病栏目 SP_ALZHDMTA
SP_CHF
SP_CHRNKIDN
SP_CNCR
SP_慢性阻塞性肺病
SP_抑郁症
SP_糖尿病
SP_ISCHMCHT
SP_OSTEOPRS
SP_RA_OA
SP_STRKETIA

python pandas concatenation boolean categorical-data
1个回答
0
投票

我假设值 2 对应于患病,否则为 1。通过为每种疾病分配唯一的位位置,可以将所有疾病的布尔值连接到单个列中。然后,您可以根据给定行是否存在这些问题来“切换”这些位。然后使用按位 OR (

|
) 运算符连接这些位。同时,您可以在单独的列中统计每一行的疾病数量。

# Define the relevant column names and a unique bit for each illness
col_bits = {
    "SP_ALZHDMTA"   : 0b10000000000,
    "SP_CHF"        : 0b01000000000,
    "SP_CHRNKIDN"   : 0b00100000000,
    "SP_CNCR"       : 0b00010000000,
    "SP_COPD"       : 0b00001000000,
    "SP_DEPRESSN"   : 0b00000100000,
    "SP_DIABETES"   : 0b00000010000,
    "SP_ISCHMCHT"   : 0b00000001000,
    "SP_OSTEOPRS"   : 0b00000000100,
    "SP_RA_OA"      : 0b00000000010,
    "SP_STRKETIA"   : 0b00000000001,
}
col_names = col_bits.keys()

# Assume 2 means having the illness
def has_illness(val):
    return int(val) == 2
def get_illness_bit(col_name, val):
    return col_bits[col_name] if val else 0b00000000000

# A pd Series containing the concatenation of bits representing relevant illnesses
illnesses_bits_col = pd.Series(np.array([0b00000000000 for _ in range(len(df))]))
# A pd Series containing the number of relevant illnesses had by each row
illnesses_counts_col = pd.Series(np.array([0 for _ in range(len(df))]))
for col_name in col_names:
    # pd Series containing bool value representations of the current illness `col_name`
    illness_col = df[col_name].apply(has_illness)

    # concatenate the bit representation of the current illness `col_name`
    illness_bit_col = illness_col.apply(lambda x: get_illness_bit(col_name, x))
    illnesses_bits_col |= illness_bit_col
    
    # add to counter the current illness `col_name`
    illness_count_col = illness_col.apply(lambda x: 1 if x else 0)
    illnesses_counts_col += illness_count_col
illnesses_counts_col = illnesses_counts_col.apply(lambda x: "Multiple" if x >= 3 else "-")

print(illnesses_bits_col)
print(illnesses_counts_col)
© www.soinside.com 2019 - 2024. All rights reserved.