我有以下玩具数据框:
df <- data.frame(
product = c("apple", "banana", "cherry", "durian", "eggplant", "fuyu"),
ingredients = c("flour|fibre|500", "sugar|500", "505|wheat|flavouring", "fibre(500)|eggs", "wholegrainrice|sesameoil", "500|fibre|500"),
stringsAsFactors = FALSE
)
我的目标是检测产品成分中是否出现纤维,计算它出现的次数,并提取用于记录产品成分中纤维的值。
出于本分析的目的,产品成分中的纤维可以表示为“纤维”、“500”或“纤维(500)”。
我当前的代码是:
library(tidyverse)
fibre_strings_to_check <- c("fibre", "500", "fibre\\(500\\)")
df2 <- df %>%
mutate(
fibre_present = str_detect(ingredients, paste(fibre_strings_to_check, collapse = "|")),
fibre_count = str_count(ingredients, paste(fibre_strings_to_check, collapse = "|")),
fibre_used = str_extract_all(ingredients, paste(fibre_strings_to_check, collapse = "|"))
)
这使得
df2
的输出为:
| product | ingredients | fibre_present | fibre_count | fibre_used |
|----------|-----------------------------|---------------|-------------|-------------------|
| apple | flour\|fibre\|500 | TRUE | 2 | fibre, 500 |
| banana | sugar\|500 | TRUE | 1 | 500 |
| cherry | 505\|wheat\|flavouring | FALSE | 0 | |
| durian | fibre(500)\|eggs | TRUE | 2 | fibre, 500 |
| eggplant | wholegrainrice\|sesameoil | FALSE | 0 | |
| fuyu | 500\|fibre\|500 | TRUE | 3 | 500, fibre, 500 |
我遇到的问题是“榴莲”产品。我希望“fibre(500)”被计为一个纤维值/实例,因为它已在
fibre_strings_to_check
中定义。但因为它似乎与 fibre_strings_to_check
中的其他光纤实例匹配,所以它被计为光纤的两个值/实例。
我的
df2
预期输出是:
| product | ingredients | fibre_present | fibre_count | fibre_used |
|----------|-----------------------------|---------------|-------------|-------------------|
| apple | flour\|fibre\|500 | TRUE | 2 | fibre, 500 |
| banana | sugar\|500 | TRUE | 1 | 500 |
| cherry | 505\|wheat\|flavouring | FALSE | 0 | |
| durian | fibre(500)\|eggs | TRUE | 1 | fibre(500) |
| eggplant | wholegrainrice\|sesameoil | FALSE | 0 | |
| fuyu | 500\|fibre\|500 | TRUE | 3 | 500, fibre, 500 |
如何调整脚本,以便不会对单个值进行重复计算?
一个快速解决方法是重新排列向量
fibre_strings_to_check
,以便 "fibre\\(500\\)"
比其余值先出现。
fibre_strings_to_check <- c("fibre\\(500\\)", "fibre", "500")
df2 <- df %>%
mutate(
fibre_present = str_detect(ingredients, paste(fibre_strings_to_check, collapse = "|")),
fibre_count = str_count(ingredients, paste(fibre_strings_to_check, collapse = "|")),
fibre_used = str_extract_all(ingredients, paste(fibre_strings_to_check, collapse = "|"))
)
df2
# product ingredients fibre_present fibre_count fibre_used
#1 apple flour|fibre|500 TRUE 2 fibre, 500
#2 banana sugar|500 TRUE 1 500
#3 cherry 505|wheat|flavouring FALSE 0
#4 durian fibre(500)|eggs TRUE 1 fibre(500)
#5 eggplant wholegrainrice|sesameoil FALSE 0
#6 fuyu 500|fibre|500 TRUE 3 500, fibre, 500