我创建了df
,其中包含8,000多个公司年
gvkey
=公司ID
[fam
=虚拟(如果企业是家族企业,则等于1)]
industry
=类别变量
gvkey fam industry
1 1004 0 6
2 1004 0 6
3 1004 0 6
4 1004 0 6
5 1004 0 6
6 1013 0 4
7 1013 0 4
8 1013 0 4
9 1013 0 4
10 1013 0 4
11 1013 0 4
12 1045 0 5
13 1045 0 5
14 1045 0 5
15 1045 0 5
16 1045 0 5
17 1045 0 5
18 1072 0 4
19 1072 0 4
20 1072 0 4
21 1072 0 4
22 1072 0 4
23 1076 1 9
24 1076 1 9
25 1076 1 9
26 1076 1 9
27 1076 1 9
28 1076 1 9
29 1078 0 4
30 1078 0 4
31 1078 0 4
32 1078 0 4
33 1078 0 4
34 1078 0 4
35 1121 1 6
36 1121 1 6
37 1121 1 6
38 1121 1 6
39 1121 1 6
40 1121 1 6
41 1161 0 4
42 1161 0 4
43 1161 0 4
44 1161 0 4
45 1161 0 4
46 1161 0 4
47 1209 0 4
48 1209 0 4
49 1209 0 4
50 1209 0 4
...
这是输出的外观。行业描述= industry
语言逻辑:
1]对于所有唯一的gvkey
,创建一个列,计算每个行业中的fam = 0。
2)对于所有唯一的gvkey
,创建一个列,计算每个行业中的家族数量= 1。
3)创建一个输出,显示每个行业的家族企业和非家族企业的频率
也许甚至可以用一个代码执行此操作?!
非常感谢!
一个dplyr
选项可能是:
df %>%
group_by(industry) %>%
summarise(n_family = n_distinct(gvkey[fam == 1]),
n_no_family = n_distinct(gvkey[fam == 0]),
perc_family = n_family/n_distinct(gvkey)*100)
industry n_family n_no_family perc_family
<int> <int> <int> <dbl>
1 4 0 5 0
2 5 0 1 0
3 6 1 1 50
4 9 1 0 100
您的口头逻辑对我不是很清楚(特别是关于最终输出的唯一gvkey
的陈述),但是在这里我提供了两个结果,因此您可以看到想要的是哪个:
unique(df)
进行计数dfout <- `colnames<-`(data.frame(as.matrix(aggregate(fam ~industry,
unique(df),
FUN = function(x) c(sum(x==0),sum(x==1),sum(x==1)/length(x)*100)))),
c("Industry", "FamCnt", "NoFamCnt", "FamPerc"))
诸如此类
> dfout
Industry FamCnt NoFamCnt FamPerc
1 4 5 0 0
2 5 1 0 0
3 6 1 1 50
4 9 0 1 100
df
进行计数dfout <- `colnames<-`(data.frame(as.matrix(aggregate(fam ~industry,
df,
FUN = function(x) c(sum(x==0),sum(x==1),sum(x==1)/length(x)*100)))),
c("Industry", "FamCnt", "NoFamCnt", "FamPerc"))
诸如此类
> dfout
Industry FamCnt NoFamCnt FamPerc
1 4 27 0 0.00000
2 5 6 0 0.00000
3 6 5 6 54.54545
4 9 0 6 100.00000
Base R解决方案(注意:在向量名称中保留空格通常不是一个好习惯)
# Reshape / Rename the input data:
ir_df <- setNames(reshape(setNames(aggregate(.~fam+industry, df, length),
c("fam", "industry", "count")),
direction = "wide",
idvar = "industry",
timevar = "fam"), c("Industry", "Nonfamily Firms", "Family Firms"))
# Transform the data frame to contain the final equation:
final_df <- transform(replace(ir_df, is.na(ir_df), 0),
`Percent Family Firms In Industry` =
round(`Family Firms` /
rowSums(ir_df[,grepl("family", tolower(names(ir_df)))], na.rm = TRUE)
* 100, 2))
数据:
df <- structure(list(gvkey = c(1004L, 1004L, 1004L, 1004L, 1004L, 1013L,
1013L, 1013L, 1013L, 1013L, 1013L, 1045L, 1045L, 1045L, 1045L,
1045L, 1045L, 1072L, 1072L, 1072L, 1072L, 1072L, 1076L, 1076L,
1076L, 1076L, 1076L, 1076L, 1078L, 1078L, 1078L, 1078L, 1078L,
1078L, 1121L, 1121L, 1121L, 1121L, 1121L, 1121L, 1161L, 1161L,
1161L, 1161L, 1161L, 1161L, 1209L, 1209L, 1209L, 1209L), fam = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L), industry = c(6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 4L, 4L, 4L,
5L, 5L, 5L, 5L, 5L, 5L, 4L, 4L, 4L, 4L, 4L, 9L, 9L, 9L, 9L, 9L,
9L, 4L, 4L, 4L, 4L, 4L, 4L, 6L, 6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L)), class = "data.frame", row.names = c(NA,
-50L))