我是一个充满熊猫的初学者,我无法在任何地方找到解决这个问题的方法。
假设我有两个变量:variable1,variable2。
它们可以具有以下预定义值:
variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']
但是,当前数据集只包含其中一些值:
df = pd.DataFrame({variable1 : ['1', '9', '20'],
variable2 : ['2', '2', '6']})
跨越变量时:
pd.crosstab(df.variable1, df.variable2)
我明白了:
variable2 2 6
variable1
1 1 0
20 0 1
9 1 0
有没有办法将所有可能的分类值放在列和行中,即使当前数据集不包含所有这些值?目标是在运行具有更新数据集的脚本时具有相同大小的表,该更新数据集可具有先前数据集中不存在的值。
variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']
df = pd.DataFrame({'variable1' : ['1', '9', '20'],
'variable2' : ['2', '2', '6']})
print (df)
variable1 variable2
0 1 2
1 9 2
2 20 6
df = pd.crosstab(df.variable1, df.variable2)
df = df.reindex(index=variable1, columns=variable2, fill_value=0)
print (df)
variable2 2 5 6
variable1
1 1 0 0
4 0 0 0
9 1 0 0
15 0 0 0
20 0 0 1
from collections import OrderedDict
valuelabels = OrderedDict([('S8', [['1', 'Medical oncology'],
['2', 'Hematology'],
['3', 'Hematology/Oncology'],
['4', 'Other']]),
('S9', [['1', 'Academic / Teaching Hospital'],
['2', 'Community-Based Solo Private Practice'],
['3', 'Community-Based Group Private Practice (record practice size )'], ['4', 'Community Non-Teaching Hospital'],
['5', 'Comprehensive Cancer Center'],
['6', 'Other (specify)']])])
#print (valuelabels)
df = pd.DataFrame({'variable1' : ['1', '2', '4'],
'variable2' : ['2', '3', '1']})
table = pd.crosstab(df.variable1, df.variable2)
print (table)
variable2 1 2 3
variable1
1 0 1 0
2 0 0 1
4 1 0 0
d1 = dict(list(zip([a[0] for a in valuelabels['S8']], [a[1] for a in valuelabels['S8']])))
print (d1)
{'4': 'Other', '1': 'Medical oncology', '2': 'Hematology', '3': 'Hematology/Oncology'}
d2 = dict(list(zip([a[0] for a in valuelabels['S9']], [a[1] for a in valuelabels['S9']])))
print (d2)
{'1': 'Academic / Teaching Hospital',
'3': 'Community-Based Group Private Practice (record practice size )',
'4': 'Community Non-Teaching Hospital',
'6': 'Other (specify)',
'2': 'Community-Based Solo Private Practice',
'5': 'Comprehensive Cancer Center'}
table = table.reindex(index=[a[0] for a in valuelabels['S8']],
columns=[a[0] for a in valuelabels['S9'], fill_value=0)
print (table)
variable2 1 2 3 4 5 6
variable1
1 0 1 0 0 0 0
2 0 0 1 0 0 0
3 0 0 0 0 0 0
4 1 0 0 0 0 0
table.index = table.index.to_series().map(d1).values
table.columns = table.columns.to_series().map(d2).values
print (table)
Academic / Teaching Hospital \
Medical oncology 0
Hematology 0
Hematology/Oncology 0
Other 1
Community-Based Solo Private Practice \
Medical oncology 1
Hematology 0
Hematology/Oncology 0
Other 0
Community-Based Group Private Practice (record practice size ) \
Medical oncology 0
Hematology 1
Hematology/Oncology 0
Other 0
Community Non-Teaching Hospital \
Medical oncology 0
Hematology 0
Hematology/Oncology 0
Other 0
Comprehensive Cancer Center Other (specify)
Medical oncology 0 0
Hematology 0 0
Hematology/Oncology 0 0
Other 0 0
你可以使用reindex:
ct = pd.crosstab(df.variable1, df.variable2)
ct.reindex(index=variable1, columns=variable2).fillna(0).astype('int')
Out:
variable2 2 5 6
variable1
1 1 0 0
4 0 0 0
9 1 0 0
15 0 0 0
20 0 0 1
def TargetPercentByNominal (
targetVar, # target variable
predictor): # nominal predictor
countTable = pandas.crosstab(index = predictor, columns = targetVar, margins = True, dropna = True)
x = countTable.drop('All', 1)
percentTable = countTable.div(x.sum(1), axis='index')*100
print("Frequency Table: \n")
print(countTable)
print( )
print("Percent Table: \n")
print(percentTable)
return