有没有一种方法可以根据每个值中出现的两个关键字或短语将字符串向量重新编码为具有这两个值的新向量？

Question

正如我的问题所示，我想将字符串向量转换为每个字符串中出现的两个值之一的新向量。这是我拥有的一个非常简单的数据框的示例：

data <- tibble::tibble(
  w = c("Strongly disagree", "Somewhat disagree", "Disagree", "Somewhat agree", "Strongly agree", "Agree"),
  x = c("Definitely true", "Probably true", "Somewhat false", "Definitely false", "Definitely true", "Definitely false"),
  y = c("Definitely not doing enough", "Definitely doing enough", "Possibly not doing enough", "Possibly doing enough", "Definitely not doing enough", "Somehat doing enough"),
  z = c("Very comfortable", "Comfortable", "Somewhat comfortable", "Very uncomfortable", "Somewhat uncomfortable", "Comfortable")
)

我们可以看到

中的每个字符串都有“同意”或“不同意”，

有“true”或“false”，

有“做得足够”或“做得不够”，并且

要么是“舒服”，要么是“不舒服”。是否有一个函数可以让我根据每列中存在的两个值之一创建一个新向量？让我解释一下我的意思。

# write up a function
some_function <- function(arguments) {
  "function text goes here"
}

# use new function to create a vector based on `w` from `data`
data %>% some_function(w)

# resulting vector would be:
[1] "Disagree" "Disagree" "Disagree" "Agree" "Agree" "Agree

我得到的最接近的是这个函数。但是，它会删除字符串的第一个单词。如果每个字符串的第一个单词是描述字符串其余部分的形容词，那就没问题，但在字符串只是一个单词的情况下，它会给我一个 NA。

# write function
make_dicho <- function(df = data, var) {
  
  df %>% 
    # pick out the column (equivalent to df[[var]])
    dplyr::pull({{ var }}) %>% 
    # convert to a factor
    haven::as_factor() %>% 
    # remove the first part of the factor
    stringr::str_extract("(?<=\\s).+") %>%
    # make the first letter uppercase
    stringr::str_to_sentence()
  
}
# test this on the fake data
data %>% make_dicho(., w)
[1] "Disagree" "Disagree" NA         "Agree"    "Agree"    NA

我在那里有

df

参数的原因是因为我想在

dplyr::mutate()

内部使用这个函数，就像这样

data %>% mutate(new_a = make_dicho(., w)

。

Answer 1

从您的描述来看，您很乐意删除第一个单词，除非有多个单词。如果没有空格，我们可以假设只有一个单词。

remove_first_word  <- function(x) {
    ifelse(
        grepl("\\s", x),
        sub(".+\\s(*?)", "\\1", x),
        x
    )  |>
    # Make first letter upper case
    gsub("^([a-z])", "\\U\\1", x = _, perl = TRUE)
}

然后您可以根据需要在

mutate()

中使用它：

data  |>
    mutate(
        across(w:z, remove_first_word)
    )
# # A tibble: 6 × 4
#   w        x     y                z            
#   <chr>    <chr> <chr>            <chr>        
# 1 Disagree True  Not doing enough Comfortable  
# 2 Disagree True  Doing enough     Comfortable  
# 3 Disagree False Not doing enough Comfortable  
# 4 Agree    False Doing enough     Uncomfortable
# 5 Agree    True  Not doing enough Uncomfortable
# 6 Agree    False Doing enough     Comfortable

tidyverse

版本

为了回应您的评论，这里是原始函数的

stringr

版本：

remove_first_word_tidy  <- function(x) {
    dplyr::if_else(
        stringr::str_detect(x, "\\s"),
        stringr::str_replace(x, "\\w+\\s", ""),
        x
    )  |>
    stringr::str_to_title()
}

您可以创建一个函数，该函数接受数据框和列列表并应用该函数。当您想要使用

tidyverse

时，我们可以使用整洁的选择函数和

purrr::map()

将其应用到所有所需的列并生成向量列表：

make_dicho  <- function(dat, cols) {

    out  <- dat  |>
        select({{cols}})  |>
        purrr::map(remove_first_word_tidy)
    
    # Return vector if only one column supplied
    if(length(out)==1) return(out[[1]])
    # Otherwise return list of vectors
    out
}


make_dicho(data, w) 
# [1] "Disagree" "Disagree" "Disagree" "Agree"    "Agree"    "Agree"   

make_dicho(data, y:z)
# $y
# [1] "Not Doing Enough" "Doing Enough"     "Not Doing Enough" "Doing Enough"     "Not Doing Enough" "Doing Enough"    

# $z
# [1] "Comfortable"   "Comfortable"   "Comfortable"   "Uncomfortable" "Uncomfortable" "Comfortable"

有没有一种方法可以根据每个值中出现的两个关键字或短语将字符串向量重新编码为具有这两个值的新向量？

问题描述投票：0回答：1

1个回答

`tidyverse`
版本

最新问题

有没有一种方法可以根据每个值中出现的两个关键字或短语将字符串向量重新编码为具有这两个值的新向量？

问题描述 投票：0回答：1

1个回答

tidyverse版本

最新问题

问题描述投票：0回答：1

`tidyverse`
版本