Pandas交叉表 - 如何为数据集中不存在的值打印行/列？

Question

我是一个充满熊猫的初学者，我无法在任何地方找到解决这个问题的方法。

假设我有两个变量：variable1，variable2。

它们可以具有以下预定义值：

variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']

但是，当前数据集只包含其中一些值：

df = pd.DataFrame({variable1 : ['1', '9', '20'],
                  variable2 : ['2', '2', '6']})

跨越变量时：

pd.crosstab(df.variable1, df.variable2)

我明白了：

variable2  2  6
variable1      
1          1  0
20         0  1
9          1  0

有没有办法将所有可能的分类值放在列和行中，即使当前数据集不包含所有这些值？目标是在运行具有更新数据集的脚本时具有相同大小的表，该更新数据集可具有先前数据集中不存在的值。

Answer 1

使用DataFrame.reindex：

variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']


df = pd.DataFrame({'variable1' : ['1', '9', '20'],
                  'variable2' : ['2', '2', '6']})

print (df)                  
  variable1 variable2
0         1         2
1         9         2
2        20         6

df = pd.crosstab(df.variable1, df.variable2)
df = df.reindex(index=variable1, columns=variable2, fill_value=0)
print (df)
variable2  2  5  6
variable1         
1          1  0  0
4          0  0  0
9          1  0  0
15         0  0  0
20         0  0  1

from collections import OrderedDict


valuelabels = OrderedDict([('S8', [['1', 'Medical oncology'], 
                                   ['2', 'Hematology'], 
                                   ['3', 'Hematology/Oncology'], 
                                   ['4', 'Other']]), 
                           ('S9', [['1', 'Academic / Teaching Hospital'], 
                                   ['2', 'Community-Based Solo Private Practice'], 
                                   ['3', 'Community-Based Group Private Practice (record practice size )'], ['4', 'Community Non-Teaching Hospital'], 
                                   ['5', 'Comprehensive Cancer Center'], 
                                   ['6', 'Other (specify)']])])
#print (valuelabels)


df = pd.DataFrame({'variable1' : ['1', '2', '4'],
                  'variable2' : ['2', '3', '1']})

table = pd.crosstab(df.variable1, df.variable2)      
print (table)
variable2  1  2  3
variable1         
1          0  1  0
2          0  0  1
4          1  0  0

d1 = dict(list(zip([a[0] for a in valuelabels['S8']], [a[1] for a in valuelabels['S8']])))
print (d1)
{'4': 'Other', '1': 'Medical oncology', '2': 'Hematology', '3': 'Hematology/Oncology'}

d2 = dict(list(zip([a[0] for a in valuelabels['S9']], [a[1] for a in valuelabels['S9']])))
print (d2)
{'1': 'Academic / Teaching Hospital', 
'3': 'Community-Based Group Private Practice (record practice size )', 
'4': 'Community Non-Teaching Hospital', 
'6': 'Other (specify)', 
'2': 'Community-Based Solo Private Practice', 
'5': 'Comprehensive Cancer Center'}

table = table.reindex(index=[a[0] for a in valuelabels['S8']], 
                      columns=[a[0] for a in valuelabels['S9'], fill_value=0)
print (table)
variable2  1  2  3  4  5  6
variable1                  
1          0  1  0  0  0  0
2          0  0  1  0  0  0
3          0  0  0  0  0  0
4          1  0  0  0  0  0

table.index = table.index.to_series().map(d1).values
table.columns = table.columns.to_series().map(d2).values

print (table)
                     Academic / Teaching Hospital  \
Medical oncology                                0   
Hematology                                      0   
Hematology/Oncology                             0   
Other                                           1   

                     Community-Based Solo Private Practice  \
Medical oncology                                         1   
Hematology                                               0   
Hematology/Oncology                                      0   
Other                                                    0   

                     Community-Based Group Private Practice (record practice size )  \
Medical oncology                                                     0                
Hematology                                                           1                
Hematology/Oncology                                                  0                
Other                                                                0                

                     Community Non-Teaching Hospital  \
Medical oncology                                   0   
Hematology                                         0   
Hematology/Oncology                                0   
Other                                              0   

                     Comprehensive Cancer Center  Other (specify)  
Medical oncology                               0                0  
Hematology                                     0                0  
Hematology/Oncology                            0                0  
Other                                          0                0

Answer 2

你可以使用reindex：

ct = pd.crosstab(df.variable1, df.variable2)
ct.reindex(index=variable1, columns=variable2).fillna(0).astype('int')
Out: 
variable2  2  5  6
variable1         
1          1  0  0
4          0  0  0
9          1  0  0
15         0  0  0
20         0  0  1

Answer 3

def TargetPercentByNominal (
 targetVar,       # target variable
 predictor):      # nominal predictor

countTable = pandas.crosstab(index = predictor, columns = targetVar, margins = True, dropna = True)
  x = countTable.drop('All', 1)
  percentTable = countTable.div(x.sum(1), axis='index')*100

  print("Frequency Table: \n")
  print(countTable)
  print( )
  print("Percent Table: \n")
  print(percentTable)

  return

Pandas交叉表 - 如何为数据集中不存在的值打印行/列？

问题描述投票：2回答：3

3个回答

最新问题

Pandas交叉表 - 如何为数据集中不存在的值打印行/列？

问题描述 投票：2回答：3

3个回答

最新问题

问题描述投票：2回答：3