我有一个数据框,其中包含不同的列,这些列是组,这些列的单元格是属于该列组的物种。我需要将其转换为二进制矩阵,其中列继续是标题(组),但行将是物种,那么如果物种最初位于该列组中,则它将是 1,否则将是0.
# Load the dplyr package
library(dplyr)
# Create a list of vectors with different lengths
list_of_vectors <- list(
Z1 = c("E","F","G"),
Z2 = c("A", "B", "C", "D"),
Z3 = c("H","I","J","K","L")
)
# Find the maximum length
max_length <- max(sapply(list_of_vectors, length))
# Pad the vectors with NA to make them the same length
padded_vectors <- lapply(list_of_vectors, function(x) c(x, rep(NA, max_length - length(x))))
# Create the data frame using dplyr
df <- as.data.frame(bind_cols(padded_vectors))
我想离开这个:
# data frame
Z1 Z2 Z3
1 E A H
2 F B I
3 G C J
4 NA D K
5 NA NA L
对此:
# binary matrix
Z1 Z2 Z3
E 1 0 0
F 1 0 ...
G 1 0
A 0 1
B 0 1
C 0 1
D .. 1
H 0 1
I 1
J ...
K
L
谢谢!
out <- +sapply(df, `%in%`, x = sort(unique(na.omit(unlist(df)))))
rownames(out) <- sort(unique(na.omit(unlist(df))))
out
# Z1 Z2 Z3
# A 0 1 0
# B 0 1 0
# C 0 1 0
# D 0 1 0
# E 1 0 0
# F 1 0 0
# G 1 0 0
# H 0 0 1
# I 0 0 1
# J 0 0 1
# K 0 0 1
# L 0 0 1
或者作为一句台词:
with(list(r = sort(unique(na.omit(unlist(df))))),
`rownames<-`(+sapply(df, `%in%`, x = r), r))
备注:
我添加了
na.omit
,因为我不认为你想知道NA
存在于哪里。如果你觉得有用就交给你吧
我添加了
sort
,因为我认为它在视觉上更有意义,但它完全是可选的。
unique
不是严格必需的,但如果没有它,也会生成同名行。
最后,但:这是存在的指标,这意味着如果我们在一列中重复出现一个字母,我们只能看到
1
:
df$Z2[1] <- "B"
with(list(r = sort(unique(na.omit(unlist(df))))), `rownames<-`(+sapply(df, `%in%`, x = r), r))
# Z1 Z2 Z3
# B 0 1 0
# C 0 1 0
# D 0 1 0
# E 1 0 0
# F 1 0 0
# G 1 0 0
# H 0 0 1
# I 0 0 1
# J 0 0 1
# K 0 0 1
# L 0 0 1
如果您需要它是计数,那么我们需要
with(list(r = sort(unique(na.omit(unlist(df))))),
`rownames<-`(sapply(df, function(col) colSums(outer(col, r, `==`), na.rm = TRUE)), r))
# Z1 Z2 Z3
# B 0 2 0
# C 0 1 0
# D 0 1 0
# E 1 0 0
# F 1 0 0
# G 1 0 0
# H 0 0 1
# I 0 0 1
# J 0 0 1
# K 0 0 1
# L 0 0 1
数据
df <- structure(list(Z1 = c("E", "F", "G", NA, NA), Z2 = c("A", "B", "C", "D", NA), Z3 = c("H", "I", "J", "K", "L")), row.names = c(NA, -5L), class = "data.frame")
也许你可以像下面这样使用
table
> table(stack(df))[na.omit(unlist(df)), ]
ind
values Z1 Z2 Z3
E 1 0 0
F 1 0 0
G 1 0 0
A 0 1 0
B 0 1 0
C 0 1 0
D 0 1 0
H 0 0 1
I 0 0 1
J 0 0 1
K 0 0 1
L 0 0 1
其中
na.omit(unlist(df))
可以直接用作表的行名称并对行进行重新排序。