将数据框转换为二进制矩阵,行名矩阵是原始 df 的单元格信息,列名是 df 标题

问题描述 投票:0回答:2

我有一个数据框,其中包含不同的列,这些列是组,这些列的单元格是属于该列组的物种。我需要将其转换为二进制矩阵,其中列继续是标题(组),但行将是物种,那么如果物种最初位于该列组中,则它将是 1,否则将是0.

# Load the dplyr package
library(dplyr)

# Create a list of vectors with different lengths
list_of_vectors <- list(
  Z1 = c("E","F","G"),
  Z2 = c("A", "B", "C", "D"),
  Z3 = c("H","I","J","K","L")
)

# Find the maximum length
max_length <- max(sapply(list_of_vectors, length))

# Pad the vectors with NA to make them the same length
padded_vectors <- lapply(list_of_vectors, function(x) c(x, rep(NA, max_length - length(x))))

# Create the data frame using dplyr
df <- as.data.frame(bind_cols(padded_vectors))

我想离开这个:

# data frame
   Z1   Z2    Z3
1   E    A     H
2   F    B     I
3   G    C     J
4   NA   D     K
5   NA   NA    L

对此:

# binary matrix
   Z1   Z2  Z3
E  1    0    0
F  1    0    ...
G  1    0
A  0    1
B  0    1
C  0    1
D  ..   1
H       0    1
I            1
J            ...
K
L

谢谢!

r dataframe matrix converters
2个回答
3
投票
out <- +sapply(df, `%in%`, x = sort(unique(na.omit(unlist(df)))))
rownames(out) <- sort(unique(na.omit(unlist(df))))
out
#   Z1 Z2 Z3
# A  0  1  0
# B  0  1  0
# C  0  1  0
# D  0  1  0
# E  1  0  0
# F  1  0  0
# G  1  0  0
# H  0  0  1
# I  0  0  1
# J  0  0  1
# K  0  0  1
# L  0  0  1

或者作为一句台词:

with(list(r = sort(unique(na.omit(unlist(df))))), 
     `rownames<-`(+sapply(df, `%in%`, x = r), r))

备注:

  • 我添加了

    na.omit
    ,因为我不认为你想知道
    NA
    存在于哪里。如果你觉得有用就交给你吧

  • 我添加了

    sort
    ,因为我认为它在视觉上更有意义,但它完全是可选的。

  • unique
    不是严格必需的,但如果没有它,也会生成同名行。

最后,但:这是存在的指标,这意味着如果我们在一列中重复出现一个字母,我们只能看到

1

df$Z2[1] <- "B"
with(list(r = sort(unique(na.omit(unlist(df))))), `rownames<-`(+sapply(df, `%in%`, x = r), r))
#   Z1 Z2 Z3
# B  0  1  0
# C  0  1  0
# D  0  1  0
# E  1  0  0
# F  1  0  0
# G  1  0  0
# H  0  0  1
# I  0  0  1
# J  0  0  1
# K  0  0  1
# L  0  0  1

如果您需要它是计数,那么我们需要

with(list(r = sort(unique(na.omit(unlist(df))))), 
     `rownames<-`(sapply(df, function(col) colSums(outer(col, r, `==`), na.rm = TRUE)), r))
#   Z1 Z2 Z3
# B  0  2  0
# C  0  1  0
# D  0  1  0
# E  1  0  0
# F  1  0  0
# G  1  0  0
# H  0  0  1
# I  0  0  1
# J  0  0  1
# K  0  0  1
# L  0  0  1

数据

df <- structure(list(Z1 = c("E", "F", "G", NA, NA), Z2 = c("A", "B", "C", "D", NA), Z3 = c("H", "I", "J", "K", "L")), row.names = c(NA, -5L), class = "data.frame")

0
投票

也许你可以像下面这样使用

table

> table(stack(df))[na.omit(unlist(df)), ]
      ind
values Z1 Z2 Z3
     E  1  0  0
     F  1  0  0
     G  1  0  0
     A  0  1  0
     B  0  1  0
     C  0  1  0
     D  0  1  0
     H  0  0  1
     I  0  0  1
     J  0  0  1
     K  0  0  1
     L  0  0  1

其中

na.omit(unlist(df))
可以直接用作表的行名称并对行进行重新排序。

© www.soinside.com 2019 - 2024. All rights reserved.