如何匹配和替换列名称的子集

Question

我正在尝试用更具描述性的标签替换数据框中的一些（但不是全部）列名称。我有一个带有longname的向量，需要匹配并替换当前相关的列名。

更详细：

我有一个包含文本和数字列的数据框。例如

df<-data.frame(text1=c("nnnn","uuuu","ooo"),
               text2=c("b","t","eee"),
               a1=c(1,2,3),
               a2=c(45,43,23),
               b1=c(43,6,2),
               text3=c("gg","ll","jj"))

所以它看起来像这样：

df
  text1 text2 a1 a2 b1 text3
1  nnnn     b  1 45 43    gg
2  uuuu     t  2 43  6    ll
3   ooo   eee  3 23  2    jj

对于某些列标签，我还有一个较长标签的向量：

longnames=c("a1 age","a2 gender","b1 postcode")

如果有一个匹配的长名称，我想在df中完全替换相应的短名称。所以我想要的输出是：

  text1 text2 a1 age a2 gender b1 postcode text3
1  nnnn     b      1        45          43    gg
2  uuuu     t      2        43           6    ll
3   ooo   eee      3        23           2    jj

所有需要替换的短标签都与相关长标签的开头匹配。换句话说，短标签“a2”需要用长标签“a2 gender”代替，而这个长标签是唯一以“a2”开头的长标签。

Answer 1

dplyr::rename可以一次重命名列的子集，但它需要新名称的命名向量。

library("tidyverse")

df <- data.frame(
  text1 = c("nnnn", "uuuu", "ooo"),
  text2 = c("b", "t", "eee"),
  a1 = c(1, 2, 3),
  a2 = c(45, 43, 23),
  b1 = c(43, 6, 2),
  text3 = c("gg", "ll", "jj")
)

longnames <- c("a1 age", "a2 gender", "b1 postcode")
shortnames <- str_extract(longnames, "^(\\w+)")

# named vector specifying how to rename
names(shortnames) <- longnames
shortnames
#>      a1 age   a2 gender b1 postcode 
#>        "a1"        "a2"        "b1"

df %>%
  rename(!!shortnames)
#>   text1 text2 a1 age a2 gender b1 postcode text3
#> 1  nnnn     b      1        45          43    gg
#> 2  uuuu     t      2        43           6    ll
#> 3   ooo   eee      3        23           2    jj

# In this case `!!shortnames` achieves this:

df %>%
  rename("a1 age" = "a1",
         "a2 gender" = "a2",
         "b1 postcode" = "b1")
#>   text1 text2 a1 age a2 gender b1 postcode text3
#> 1  nnnn     b      1        45          43    gg
#> 2  uuuu     t      2        43           6    ll
#> 3   ooo   eee      3        23           2    jj

由reprex package创建于2019-03-28（v0.2.1）

以编程方式指定新名称很有用，因为我们可以更轻松，更干净地更改列名称规范。但是为了更具可读性，您可以首先使用显式规范，这只是更多的写作。

Answer 2

m1 = sapply(names(df), function(snm) sapply(longnames, function(lnm) grepl(snm, lnm)))
df1 = setNames(df, replace(names(df), colSums(m1) == 1, longnames[rowSums(m1) == 1]))
df1
#  text1 text2 a1 age a2 gender b1 postcode text3
#1  nnnn     b      1        45          43    gg
#2  uuuu     t      2        43           6    ll
#3   ooo   eee      3        23           2    jj

m1是一个矩阵，显示df和longnames列名称之间的匹配。 colSums(m1) == 1标识具有匹配项的列名称。 rowSums(m1) == 1识别各自匹配的longnames。

或使用部分匹配

inds = pmatch(colnames(df), longnames)
df1 = setNames(df, replace(longnames[inds], is.na(inds), colnames(df)[is.na(inds)]))

Answer 3

你可以使用已经矢量化的adist：

a = which(!attr(adist(names(df),longnames,counts = T),'counts')[,,'sub'],T)

names(df)[a[,'row']] = longnames    #longnames[a[,'col']]

df
  text1 text2 a1 age a2 gender b1 postcode text3
1  nnnn     b      1        45          43    gg
2  uuuu     t      2        43           6    ll
3   ooo   eee      3        23           2    jj

Answer 4

使用sapply实现这一目标的一种方法。这可以使用for循环以及几乎精确的代码完成。 seq.int(colnames(df))产生1：ncol(df)的序列。当grep的相应列名匹配时，longnames在df中找到该索引。然后if条件检查索引向量的长度是否> 0（如果列匹配则应该是这样）。然后它进行更换。

## sapply (can be replaced with lapply)
sapply(seq.int(colnames(df)), function(x) {
  index <- grep(colnames(df)[x], longnames)
  if (length(index) > 0) colnames(df)[x] <<- longnames[index]
})

要么

## for loop (note the difference in <<-)
for (x in seq.int(colnames(df))) {
  index <- grep(colnames(df)[x], longnames)
  if (length(index) > 0) colnames(df)[x] <- longnames[index]
}

如何匹配和替换列名称的子集

问题描述投票：0回答：4

4个回答

最新问题

如何匹配和替换列名称的子集

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4