如果行和列名称以相同前缀开头，则将矩阵值设置为 0

Question

假设您有以下数据框：

df <- data.frame(industry = c("DEU_10T12", "DEU_13T15", "DEU_16", "DEU_17", "ITA_10T12", "ITA_13T15", "ITA_16", "ITA_17"),
DEU_10T12 = c(20, 24, 26, 20, 10, 0, NA, 1.5),DEU_13T15 = c(15, 16, 4.5, NA, 7.5, 5, 3, 0),
DEU_16 = c(1.5, 6, 4, 0, 0.5, 15, 3, 0.5),DEU_17 = c(NA, 20, 10, 2, 0, 0, 0, 7),
ITA_10T12 = c(0.5, 2, 3, 4, 10, 50, 2, 15), ITA_13T15 = c(25, 0, 4.5, NA, 17.5, 5, 13, 0.9),
ITA_16 = c(2, 3, 40, 20, 0.5, 15, 3, 1),ITA_17 = c(1, 9, 0.5, 2, 10, 20, 50, 7))

目标是拥有以下矩阵（它应该是数字并处理 NA 求和）：

df2 <- data.frame(industry = c("DEU_10T12", "DEU_13T15", "DEU_16", "DEU_17", "ITA_10T12", "ITA_13T15", "ITA_16", "ITA_17"),
DEU_10T12 = c(0, 0, 0, 0, 10, 0, NA, 1.5),DEU_13T15 = c(0, 0, 0, 0, 7.5, 5, 3, 0),
DEU_16 = c(0, 0, 0, 0, 0.5, 15, 3, 0.5),DEU_17 = c(0, 0, 0, 0, 0, 0, 0, 7),
ITA_10T12 = c(0.5, 2, 3, 4, 0, 0, 0, 0),  ITA_13T15 = c(25, 0, 4.5, NA, 0, 0, 0, 0),
ITA_16 = c(2, 3, 40, 20, 0, 0, 0, 0),ITA_17 = c(1, 9, 0.5, 2, 0, 0, 0, 0))

新矩阵（df2，转换为数字）将镜像原始矩阵（df，也是数字）的值，除非行条目与其相应的列条目共享相同的前三个字符。在这种情况下，例如行中的 DEU_10T12 和以 DEU 开头的列，该值将设置为零，忽略任何现有的 NA 值。

我尝试如下。首先，我将 df 转换为数字，如下所示

# Extract row and column names
row_names <- df$industry
col_names <- colnames(df)[-1]  # Exclude 'industry' column

# Create an empty matrix
Z <- matrix(NA, nrow = length(row_names), ncol = length(col_names), dimnames = list(row_names, col_names))

# Fill in the matrix with values from the data frame
for (i in 1:length(row_names)) {
for (j in 1:length(col_names)) {
Z[i, j] <- df[i, col_names[j]]
}
}

# Create an empty matrix for Z_narrow
Z_narrow = matrix(0, nrow = nrow(Z), ncol = ncol(Z))
# Assign row and column names
rownames(Z_narrow) = rownames(Z)
colnames(Z_narrow) = colnames(Z)

# Function to get the indices of columns to be replaced with zeros based on the first three characters of the column name
get_zero_indices <- function(col_name, row_names) {substr(col_name, 1, 3) == substr(row_names, 1, 3)}


# Loop through each row of Z to populate Z_narrow
for (i in 1:nrow(Z)) {
row_name <- rownames(Z)[i]
indices_to_zero <- sapply(colnames(Z), get_zero_indices, row_names = row_name)
Z_narrow[i, indices_to_zero] <- 0
Z_narrow[i, !indices_to_zero] <- Z[i, !indices_to_zero]
}

此代码在使用这个小数据集时可以工作，但在应用于较大的数据集时会导致 R 崩溃。有什么建议吗？

Answer 1

可以融化原始dataframe，如果前三个字符匹配则设置为0；然后投射回广角

library(data.table)
setDT(df)
dcast(
  melt(df,id.vars = "industry")[substr(industry,1,3) == substr(variable,1,3), value:=0],
  industry~variable
)

输出

    industry DEU_10T12 DEU_13T15 DEU_16 DEU_17 ITA_10T12 ITA_13T15 ITA_16 ITA_17
      <char>     <num>     <num>  <num>  <num>     <num>     <num>  <num>  <num>
1: DEU_10T12       0.0       0.0    0.0      0       0.5      25.0      2    1.0
2: DEU_13T15       0.0       0.0    0.0      0       2.0       0.0      3    9.0
3:    DEU_16       0.0       0.0    0.0      0       3.0       4.5     40    0.5
4:    DEU_17       0.0       0.0    0.0      0       4.0        NA     20    2.0
5: ITA_10T12      10.0       7.5    0.5      0       0.0       0.0      0    0.0
6: ITA_13T15       0.0       5.0   15.0      0       0.0       0.0      0    0.0
7:    ITA_16        NA       3.0    3.0      0       0.0       0.0      0    0.0
8:    ITA_17       1.5       0.0    0.5      7       0.0       0.0      0    0.0

另一种方法，完全不使用任何重塑：

mask = apply(df, 1, \(x) c(F,substr(x[1],1,3)==substr(names(x[2:length(x)]),1,3)))
df[t(mask)] <- 0

输出：

   industry DEU_10T12 DEU_13T15 DEU_16 DEU_17 ITA_10T12 ITA_13T15 ITA_16 ITA_17
1 DEU_10T12       0.0       0.0    0.0      0       0.5      25.0      2    1.0
2 DEU_13T15       0.0       0.0    0.0      0       2.0       0.0      3    9.0
3    DEU_16       0.0       0.0    0.0      0       3.0       4.5     40    0.5
4    DEU_17       0.0       0.0    0.0      0       4.0        NA     20    2.0
5 ITA_10T12      10.0       7.5    0.5      0       0.0       0.0      0    0.0
6 ITA_13T15       0.0       5.0   15.0      0       0.0       0.0      0    0.0
7    ITA_16        NA       3.0    3.0      0       0.0       0.0      0    0.0
8    ITA_17       1.5       0.0    0.5      7       0.0       0.0      0    0.0

Answer 2

与@langtang相同的方法，但使用

tidyverse

函数：


library(tidyverse)

df |> 
  pivot_longer(-industry) |> 
  mutate(value = ifelse(substr(industry,1,3)==substr(name,1,3),0,value)) |> 
  pivot_wider()


  industry  DEU_10T12 DEU_13T15 DEU_16 DEU_17 ITA_10T12 ITA_13T15 ITA_16 ITA_17
  <chr>         <dbl>     <dbl>  <dbl>  <dbl>     <dbl>     <dbl>  <dbl>  <dbl>
1 DEU_10T12       0         0      0        0       0.5      25        2    1  
2 DEU_13T15       0         0      0        0       2         0        3    9  
3 DEU_16          0         0      0        0       3         4.5     40    0.5
4 DEU_17          0         0      0        0       4        NA       20    2  
5 ITA_10T12      10         7.5    0.5      0       0         0        0    0  
6 ITA_13T15       0         5     15        0       0         0        0    0  
7 ITA_16         NA         3      3        0       0         0        0    0  
8 ITA_17          1.5       0      0.5      7       0         0        0    0

Answer 3

在基本 R 中，不要循环遍历各个行和列，而是找到唯一的前缀并循环这些前缀：

out <- as.matrix(df[, -1])
rownames(out) <- df[, 1]

prefixes <- out |>
  colnames() |>
  substr(1, 3) |> 
  unique()
prefixes <- paste0("^", prefixes)

for (pfx in prefixes) {
  out[grepl(pfx, rownames(out)),
      grepl(pfx, colnames(out))] <- 0
}

结果：

#> out
          DEU_10T12 DEU_13T15 DEU_16 DEU_17 ITA_10T12 ITA_13T15 ITA_16 ITA_17
DEU_10T12       0.0       0.0    0.0      0       0.5      25.0      2    1.0
DEU_13T15       0.0       0.0    0.0      0       2.0       0.0      3    9.0
DEU_16          0.0       0.0    0.0      0       3.0       4.5     40    0.5
DEU_17          0.0       0.0    0.0      0       4.0        NA     20    2.0
ITA_10T12      10.0       7.5    0.5      0       0.0       0.0      0    0.0
ITA_13T15       0.0       5.0   15.0      0       0.0       0.0      0    0.0
ITA_16           NA       3.0    3.0      0       0.0       0.0      0    0.0
ITA_17          1.5       0.0    0.5      7       0.0       0.0      0    0.0

如果行和列名称以相同前缀开头，则将矩阵值设置为 0

问题描述投票：0回答：3

3个回答

最新问题

如果行和列名称以相同前缀开头，则将矩阵值设置为 0

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3