我有一个如下所示的数据集:
data <- data.frame(
Col1 = c("id1", "id2", "id3", "id4","id5", "id6", "id7", "id8"),
Col2 = c("A", "Bc", "A", "As", "As", "Bs", "A", "A"),
Col3 = c("BK", "AB", "BsC", "BX", "BK", "AsB", "BC", "BX"),
Col4 = c("CA", "XB", "CA", "SC", "CA", "CA", "CA", "SC"),
Col5 = c("Ao", "Bu", "Ai", "Ayy", "Ao", "Byu", "Aiy", "Ay"),
Col6 = c("Bc", "Bc", "Bc", "Bc", "Bc", "Bc", "Be", "Bd")
)
data
Col1 Col2 Col3 Col4 Col5 Col6
1 id1 A BK CA Ao Bc
2 id2 Bc AB XB Bu Bc
3 id3 A BsC CA Ai Bc
4 id4 As BX SC Ayy Bc
5 id5 As BK CA Ao Bc
6 id6 Bs AsB CA Byu Bc
7 id7 A BC CA Aiy Be
8 id8 A BX SC Ay Bd
如果某个类别的代表性过多,则需要省略这些列。例如,如果阈值是
0.74
或 74%
,则 filtered data
将删除 Col6
,因为类别 Bc
的比例过高 (6/8=75%)
。 filtered data
将如下所示:
Col1 Col2 Col3 Col4 Col5
1 id1 A BK CA Ao
2 id2 Bc AB XB Bu
3 id3 A BsC CA Ai
4 id4 As BX SC Ayy
5 id5 As BK CA Ao
6 id6 Bs AsB CA Byu
7 id7 A BC CA Aiy
8 id8 A BX SC Ay
或者,如果阈值是
60%
,则 filtered data
将删除 Col4
和 Col6
,因为类别 CA
(在 Col4
中)的代表性过高 (5/8=62.5%)
和 Bc
(在 Col6
中) ) 的比例过高 (6/8=75%)
。 filtered data
将如下所示:
Col1 Col2 Col3 Col5
1 id1 A BK Ao
2 id2 Bc AB Bu
3 id3 A BsC Ai
4 id4 As BX Ayy
5 id5 As BK Ao
6 id6 Bs AsB Byu
7 id7 A BC Aiy
8 id8 A BX Ay
循环列获取表频率,检查小于阈值的天气:
x = 0.74
data[ sapply(data, function(i) max(prop.table(table(i)))) < x ]
# Col1 Col2 Col3 Col4 Col5
# 1 id1 A BK CA Ao
# 2 id2 Bc AB XB Bu
# 3 id3 A BsC CA Ai
# 4 id4 As BX SC Ayy
# 5 id5 As BK CA Ao
# 6 id6 Bs AsB CA Byu
# 7 id7 A BC CA Aiy
# 8 id8 A BX SC Ay
这是
base
的解决方案:
data[c(TRUE, apply(t(data[-1]),1,function(x) max(table(x)))/nrow(data) < 0.6)]
#> Col1 Col2 Col3 Col5
#> 1 id1 A BK Ao
#> 2 id2 Bc AB Bu
#> 3 id3 A BsC Ai
#> 4 id4 As BX Ayy
#> 5 id5 As BK Ao
#> 6 id6 Bs AsB Byu
#> 7 id7 A BC Aiy
#> 8 id8 A BX Ay
这是使用
apply
、any
和 proportions
的解决方案:
thresh <- 0.74
overrepcols <- apply(data, 2, function(x) any(proportions(table(x)) > thresh))
data[,!overrepcols]
输出:
Col1 Col2 Col3 Col4 Col5
1 id1 A BK CA Ao
2 id2 Bc AB XB Bu
3 id3 A BsC CA Ai
4 id4 As BX SC Ayy
5 id5 As BK CA Ao
6 id6 Bs AsB CA Byu
7 id7 A BC CA Aiy
8 id8 A BX SC Ay
另一个答案使用
dplyr
和 base
data %>%
select_if(~max(table(.x)) / length(.x) < 0.6)
# Col1 Col2 Col3 Col5
# 1 id1 A BK Ao
# 2 id2 Bc AB Bu
# 3 id3 A BsC Ai
# 4 id4 As BX Ayy
# 5 id5 As BK Ao
# 6 id6 Bs AsB Byu
# 7 id7 A BC Aiy
# 8 id8 A BX Ay
我们可以使用当前的
dplyr
语法与 select(where(condition))
library(dplyr)
threshold <- 0.74
data |>
select(where(\(x) !any(proportions(table(x)) > threshold)))
Col1 Col2 Col3 Col4 Col5
1 id1 A BK CA Ao
2 id2 Bc AB XB Bu
3 id3 A BsC CA Ai
4 id4 As BX SC Ayy
5 id5 As BK CA Ao
6 id6 Bs AsB CA Byu
7 id7 A BC CA Aiy
8 id8 A BX SC Ay