删除类别值过多的列

问题描述 投票:0回答:5

我有一个如下所示的数据集:

data <- data.frame(
  Col1 = c("id1", "id2", "id3", "id4","id5",  "id6", "id7", "id8"),
  Col2 = c("A", "Bc", "A", "As", "As", "Bs", "A", "A"),
  Col3 = c("BK", "AB", "BsC", "BX", "BK", "AsB", "BC", "BX"),
  Col4 = c("CA", "XB", "CA", "SC", "CA", "CA", "CA", "SC"),
  Col5 = c("Ao", "Bu", "Ai", "Ayy", "Ao", "Byu", "Aiy", "Ay"),
  Col6 = c("Bc", "Bc", "Bc", "Bc", "Bc", "Bc", "Be", "Bd")
)

data

  Col1 Col2 Col3 Col4 Col5 Col6
1  id1    A   BK   CA   Ao   Bc
2  id2   Bc   AB   XB   Bu   Bc
3  id3    A  BsC   CA   Ai   Bc
4  id4   As   BX   SC  Ayy   Bc
5  id5   As   BK   CA   Ao   Bc
6  id6   Bs  AsB   CA  Byu   Bc
7  id7    A   BC   CA  Aiy   Be
8  id8    A   BX   SC   Ay   Bd

如果某个类别的代表性过多,则需要省略这些列。例如,如果阈值是

0.74
74%
,则
filtered data
将删除
Col6
,因为类别
Bc
的比例过高
(6/8=75%)
filtered data
将如下所示:

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

或者,如果阈值是

60%
,则
filtered data
将删除
Col4
Col6
,因为类别
CA
(在
Col4
中)的代表性过高
(5/8=62.5%)
Bc
(在
Col6
中) ) 的比例过高
(6/8=75%)
filtered data
将如下所示:

  Col1 Col2 Col3 Col5
1  id1    A   BK   Ao
2  id2   Bc   AB   Bu
3  id3    A  BsC   Ai
4  id4   As   BX  Ayy
5  id5   As   BK   Ao
6  id6   Bs  AsB  Byu
7  id7    A   BC  Aiy
8  id8    A   BX   Ay
r dataframe dplyr
5个回答
2
投票

循环列获取表频率,检查小于阈值的天气:

x = 0.74
data[ sapply(data, function(i) max(prop.table(table(i)))) < x ]
#   Col1 Col2 Col3 Col4 Col5
# 1  id1    A   BK   CA   Ao
# 2  id2   Bc   AB   XB   Bu
# 3  id3    A  BsC   CA   Ai
# 4  id4   As   BX   SC  Ayy
# 5  id5   As   BK   CA   Ao
# 6  id6   Bs  AsB   CA  Byu
# 7  id7    A   BC   CA  Aiy
# 8  id8    A   BX   SC   Ay

2
投票

这是

base
的解决方案:

data[c(TRUE, apply(t(data[-1]),1,function(x) max(table(x)))/nrow(data) < 0.6)]

#>   Col1 Col2 Col3 Col5
#> 1  id1    A   BK   Ao
#> 2  id2   Bc   AB   Bu
#> 3  id3    A  BsC   Ai
#> 4  id4   As   BX  Ayy
#> 5  id5   As   BK   Ao
#> 6  id6   Bs  AsB  Byu
#> 7  id7    A   BC  Aiy
#> 8  id8    A   BX   Ay

2
投票

这是使用

apply
any
proportions
的解决方案:

thresh <- 0.74

overrepcols <- apply(data, 2, function(x) any(proportions(table(x)) > thresh))

data[,!overrepcols]

输出:

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay

2
投票

另一个答案使用

dplyr
base

data %>%
  select_if(~max(table(.x)) / length(.x) < 0.6)

#    Col1 Col2 Col3 Col5
# 1  id1    A   BK   Ao
# 2  id2   Bc   AB   Bu
# 3  id3    A  BsC   Ai
# 4  id4   As   BX  Ayy
# 5  id5   As   BK   Ao
# 6  id6   Bs  AsB  Byu
# 7  id7    A   BC  Aiy
# 8  id8    A   BX   Ay

2
投票

我们可以使用当前的

dplyr
语法与
select(where(condition))

library(dplyr)

threshold <- 0.74
data |> 
    select(where(\(x) !any(proportions(table(x)) > threshold)))

  Col1 Col2 Col3 Col4 Col5
1  id1    A   BK   CA   Ao
2  id2   Bc   AB   XB   Bu
3  id3    A  BsC   CA   Ai
4  id4   As   BX   SC  Ayy
5  id5   As   BK   CA   Ao
6  id6   Bs  AsB   CA  Byu
7  id7    A   BC   CA  Aiy
8  id8    A   BX   SC   Ay
© www.soinside.com 2019 - 2024. All rights reserved.