比较基于R中类别变量的两组

Question

我创建了df，其中包含8,000多个公司年

gvkey =公司ID

[fam =虚拟（如果企业是家族企业，则等于1）]

industry =类别变量

   gvkey   fam  industry
1   1004    0     6
2   1004    0     6
3   1004    0     6
4   1004    0     6
5   1004    0     6
6   1013    0     4
7   1013    0     4
8   1013    0     4
9   1013    0     4
10  1013    0     4
11  1013    0     4
12  1045    0     5
13  1045    0     5
14  1045    0     5
15  1045    0     5
16  1045    0     5
17  1045    0     5
18  1072    0     4
19  1072    0     4
20  1072    0     4
21  1072    0     4
22  1072    0     4
23  1076    1     9
24  1076    1     9
25  1076    1     9
26  1076    1     9
27  1076    1     9
28  1076    1     9
29  1078    0     4
30  1078    0     4
31  1078    0     4
32  1078    0     4
33  1078    0     4
34  1078    0     4
35  1121    1     6
36  1121    1     6
37  1121    1     6
38  1121    1     6
39  1121    1     6
40  1121    1     6
41  1161    0     4
42  1161    0     4
43  1161    0     4
44  1161    0     4
45  1161    0     4
46  1161    0     4
47  1209    0     4
48  1209    0     4
49  1209    0     4
50  1209    0     4
...

这是输出的外观。行业描述= industry

语言逻辑：

1]对于所有唯一的gvkey，创建一个列，计算每个行业中的fam = 0。

2）对于所有唯一的gvkey，创建一个列，计算每个行业中的家族数量= 1。

3）创建一个输出，显示每个行业的家族企业和非家族企业的频率

也许甚至可以用一个代码执行此操作？！

非常感谢！

Answer 1

一个dplyr选项可能是：

df %>%
 group_by(industry) %>%
 summarise(n_family = n_distinct(gvkey[fam == 1]),
           n_no_family = n_distinct(gvkey[fam == 0]),
           perc_family = n_family/n_distinct(gvkey)*100) 

  industry n_family n_no_family perc_family
     <int>    <int>       <int>       <dbl>
1        4        0           5           0
2        5        0           1           0
3        6        1           1          50
4        9        1           0         100

Answer 2

您的口头逻辑对我不是很清楚（特别是关于最终输出的唯一gvkey的陈述），但是在这里我提供了两个结果，因此您可以看到想要的是哪个：

结果1：使用unique(df)进行计数

dfout <- `colnames<-`(data.frame(as.matrix(aggregate(fam ~industry,
                                                     unique(df),
                                                     FUN = function(x) c(sum(x==0),sum(x==1),sum(x==1)/length(x)*100)))), 
                      c("Industry", "FamCnt", "NoFamCnt", "FamPerc"))

诸如此类

> dfout
  Industry FamCnt NoFamCnt FamPerc
1        4      5        0       0
2        5      1        0       0
3        6      1        1      50
4        9      0        1     100

结果2：使用df进行计数

dfout <- `colnames<-`(data.frame(as.matrix(aggregate(fam ~industry,
                                                     df,
                                                     FUN = function(x) c(sum(x==0),sum(x==1),sum(x==1)/length(x)*100)))), 
                      c("Industry", "FamCnt", "NoFamCnt", "FamPerc"))

诸如此类

> dfout
  Industry FamCnt NoFamCnt   FamPerc
1        4     27        0   0.00000
2        5      6        0   0.00000
3        6      5        6  54.54545
4        9      0        6 100.00000

Answer 3

Base R解决方案（注意：在向量名称中保留空格通常不是一个好习惯）

# Reshape / Rename the input data: 

ir_df <- setNames(reshape(setNames(aggregate(.~fam+industry, df, length),

                                           c("fam", "industry", "count")),
               direction = "wide",

               idvar = "industry", 

               timevar = "fam"), c("Industry", "Nonfamily Firms", "Family Firms"))

# Transform the data frame to contain the final equation: 

final_df <- transform(replace(ir_df, is.na(ir_df), 0), 

                      `Percent Family Firms In Industry` = 

                        round(`Family Firms` /

                        rowSums(ir_df[,grepl("family", tolower(names(ir_df)))], na.rm = TRUE)

                      * 100, 2))

数据：

df <- structure(list(gvkey = c(1004L, 1004L, 1004L, 1004L, 1004L, 1013L, 
1013L, 1013L, 1013L, 1013L, 1013L, 1045L, 1045L, 1045L, 1045L, 
1045L, 1045L, 1072L, 1072L, 1072L, 1072L, 1072L, 1076L, 1076L, 
1076L, 1076L, 1076L, 1076L, 1078L, 1078L, 1078L, 1078L, 1078L, 
1078L, 1121L, 1121L, 1121L, 1121L, 1121L, 1121L, 1161L, 1161L, 
1161L, 1161L, 1161L, 1161L, 1209L, 1209L, 1209L, 1209L), fam = c(0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 
0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L), industry = c(6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 4L, 4L, 4L, 
5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 4L, 4L, 4L, 9L, 9L, 9L, 9L, 9L, 
9L, 4L, 4L, 4L, 4L, 4L, 4L, 6L, 6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L)), class = "data.frame", row.names = c(NA, 
-50L))

比较基于R中类别变量的两组

问题描述投票：2回答：3

3个回答

最新问题

比较基于R中类别变量的两组

问题描述 投票：2回答：3

3个回答

最新问题

问题描述投票：2回答：3