我有以下数据表,我想通过在现有列上添加一些条件来获取计数,如果我能得到相同的解决方案,那将是非常有帮助的。
输入:
Key1 id1-age id2-age id3-age id4-age id5-age id1-gender id2-gender id3-gender id4-gender id5-gender
0 a 6 32 61 22 23 M F M F F
1 b 36 25 52 16 33 M M F F M
2 c 12 21 45 15 66 F M M M F
问题陈述
单个密钥作为该特定密钥的个人的多个年龄和性别。年龄 id,&我想要创建列,其中包含 python w.r.t 中每行的年龄组计数。其性别。
预期输出如下:
预期输出:
Key1 id1-age id2-age id3-age id4-age id5-age age(02-15) age(16-21) age(21-30) age(31-40) age(41-50) age(51-60) age(61+)
0 a 6 32 61 22 23 1 0 2 1 0 0 1
1 b 36 25 52 16 33 0 1 1 2 0 1 0
2 c 12 21 45 15 66 2 1 0 0 1 0 1
我希望我能够对我的问题陈述给出正确的解释。 等待积极回应 预先感谢
您可以搜索各列并计算每行的年龄组。然后,计数值可以存储在单独的列表中,这些列表将在遍历每一行后添加到数据帧中。
这是我的方法。这不是最短的代码,还可以改进。
import pandas as pd
df = pd.DataFrame([['a', 6, 32, 61, 22, 23],
['b', 36, 25, 52, 16, 33],
['c', 12, 21, 45, 15, 66],
],
columns=['Key1', 'id1-age', 'id2-age', 'id3-age', 'id4-age', 'id5-age'])
age_15 = []
age_21 = []
age_30 = []
age_40 = []
age_50 = []
age_60 = []
age_61 = []
for index, record in df.iterrows():
search_columns = ['id1-age', 'id2-age', 'id3-age', 'id4-age', 'id5-age']
count_15 = 0
count_21 = 0
count_30 = 0
count_40 = 0
count_50 = 0
count_60 = 0
count_61 = 0
for search_column in search_columns:
age = record[search_column]
if age>=2 and age <= 15:
count_15 += 1
elif age>=16 and age <= 21:
count_21 += 1
elif age>21 and age <= 30:
count_30 += 1
elif age>=31 and age <= 40:
count_40 += 1
elif age>=41 and age <= 50:
count_50 += 1
elif age>=51 and age <= 60:
count_60 += 1
elif age>=61:
count_61 += 1
age_15.append(count_15)
age_21.append(count_21)
age_30.append(count_30)
age_40.append(count_40)
age_50.append(count_50)
age_60.append(count_60)
age_61.append(count_61)
df['age(02-15)'] = age_15
df['age(16-21)'] = age_21
df['age(21-30)'] = age_30
df['age(31-40)'] = age_40
df['age(41-50)'] = age_50
df['age(51-60)'] = age_60
df['age(61+)'] = age_61
print(df[['age(02-15)', 'age(16-21)', 'age(21-30)', 'age(31-40)', 'age(41-50)', 'age(51-60)', 'age(61+)']])
输出:
age(02-15) age(16-21) age(21-30) age(31-40) age(41-50) age(51-60) age(61+)
0 1 0 2 1 0 0 1
1 0 1 1 2 0 1 0
2 2 1 0 0 1 0 1
可能有不太详细的解决方案,但在列中应用条件总和
[1,5)
并将它们分配给新列,如下所示应该有所帮助:
import pandas as pd
df = pd.DataFrame({
'Key1': ['a', 'b', 'c'],
'id1-age': [6, 36, 12],
'id2-age': [32, 25, 12],
'id3-age': [61, 52, 45],
'id4-age': [22, 16, 15],
'id5-age': [23, 33, 66]
})
df['age(02-15)'] = ((df[df.columns[1:5]] >= 2) & (df[df.columns[1:5]] < 15)).sum(1)
df['age(16-21)'] = ((df[df.columns[1:5]] >= 16) & (df[df.columns[1:5]] < 21)).sum(1)
df['age(21-30)'] = ((df[df.columns[1:5]] >= 21) & (df[df.columns[1:5]] < 30)).sum(1)
df['age(31-40)'] = ((df[df.columns[1:5]] >= 31) & (df[df.columns[1:5]] < 40)).sum(1)
df['age(41-50)'] = ((df[df.columns[1:5]] >= 41) & (df[df.columns[1:5]] < 50)).sum(1)
df['age(51-60)'] = ((df[df.columns[1:5]] >= 51) & (df[df.columns[1:5]] < 60)).sum(1)
df['age(61+)'] = (df[df.columns[1:5]] >= 61).sum(1)
print(df)
如果您喜欢列名列表而不是索引范围,则可以将
df.columns[1:5]
替换为 ['id1-age', 'id2-age', 'id3-age', 'id4-age', 'id5-age']
,甚至将其定义为变量以避免一遍又一遍地重复。那么,它可能会变成:
import pandas as pd
df = pd.DataFrame({
'Key1': ['a', 'b', 'c'],
'id1-age': [6, 36, 12],
'id2-age': [32, 25, 12],
'id3-age': [61, 52, 45],
'id4-age': [22, 16, 15],
'id5-age': [23, 33, 66]
})
range_cols = df[['id1-age', 'id2-age', 'id3-age', 'id4-age', 'id5-age']]
df['age(02-15)'] = ((range_cols >= 2) & (range_cols < 15)).sum(1)
df['age(16-21)'] = ((range_cols >= 16) & (range_cols < 21)).sum(1)
df['age(21-30)'] = ((range_cols >= 21) & (range_cols < 30)).sum(1)
df['age(31-40)'] = ((range_cols >= 31) & (range_cols < 40)).sum(1)
df['age(41-50)'] = ((range_cols >= 41) & (range_cols < 50)).sum(1)
df['age(51-60)'] = ((range_cols >= 51) & (range_cols < 60)).sum(1)
df['age(61+)'] = (range_cols >= 61).sum(1)
print(df)
你可以使用
pandas.cut()
鉴于您的数据框称为 df ,就像这样
df.apply(lambda r : pd.cut(r,[15,21,31,41,61,1000]).value_counts() , axis = 1)
然后合并数据框