如何使用 str_extract_all 正确提取我的预期值?

问题描述 投票:0回答:1

我有以下玩具数据框:

df <- data.frame(
  product = c("apple", "banana", "cherry", "durian", "eggplant", "fuyu"),
  ingredients = c("flour|fibre|500", "sugar|500", "505|wheat|flavouring", "fibre(500)|eggs", "wholegrainrice|sesameoil", "500|fibre|500"),
  stringsAsFactors = FALSE
)

我的目标是检测产品成分中是否出现纤维,计算它出现的次数,并提取用于记录产品成分中纤维的值。

出于本分析的目的,产品成分中的纤维可以表示为“纤维”、“500”或“纤维(500)”。

我当前的代码是:

library(tidyverse)

fibre_strings_to_check <- c("fibre", "500", "fibre\\(500\\)")

df2 <- df %>%
  mutate(
    fibre_present = str_detect(ingredients, paste(fibre_strings_to_check, collapse = "|")),
    fibre_count = str_count(ingredients, paste(fibre_strings_to_check, collapse = "|")),
    fibre_used = str_extract_all(ingredients, paste(fibre_strings_to_check, collapse = "|"))
  )

这使得

df2
的输出为:

| product  | ingredients                 | fibre_present | fibre_count | fibre_used        |
|----------|-----------------------------|---------------|-------------|-------------------|
| apple    | flour\|fibre\|500           | TRUE          | 2           | fibre, 500        |
| banana   | sugar\|500                  | TRUE          | 1           | 500               |
| cherry   | 505\|wheat\|flavouring      | FALSE         | 0           |                   |
| durian   | fibre(500)\|eggs            | TRUE          | 2           | fibre, 500        |
| eggplant | wholegrainrice\|sesameoil   | FALSE         | 0           |                   |
| fuyu     | 500\|fibre\|500             | TRUE          | 3           | 500, fibre, 500   |

我遇到的问题是“榴莲”产品。我希望“fibre(500)”被计为一个纤维值/实例,因为它已在

fibre_strings_to_check
中定义。但因为它似乎与
fibre_strings_to_check
中的其他光纤实例匹配,所以它被计为光纤的两个值/实例。

我的

df2
预期输出是:

| product  | ingredients                 | fibre_present | fibre_count | fibre_used        |
|----------|-----------------------------|---------------|-------------|-------------------|
| apple    | flour\|fibre\|500           | TRUE          | 2           | fibre, 500        |
| banana   | sugar\|500                  | TRUE          | 1           | 500               |
| cherry   | 505\|wheat\|flavouring      | FALSE         | 0           |                   |
| durian   | fibre(500)\|eggs            | TRUE          | 1           | fibre(500)        |
| eggplant | wholegrainrice\|sesameoil   | FALSE         | 0           |                   |
| fuyu     | 500\|fibre\|500             | TRUE          | 3           | 500, fibre, 500   |

如何调整脚本,以便不会对单个值进行重复计算?

r stringr
1个回答
0
投票

一个快速解决方法是重新排列向量

fibre_strings_to_check 
,以便
"fibre\\(500\\)"
比其余值先出现。

fibre_strings_to_check <- c("fibre\\(500\\)", "fibre", "500")

df2 <- df %>%
  mutate(
    fibre_present = str_detect(ingredients, paste(fibre_strings_to_check, collapse = "|")),
    fibre_count = str_count(ingredients, paste(fibre_strings_to_check, collapse = "|")),
    fibre_used = str_extract_all(ingredients, paste(fibre_strings_to_check, collapse = "|"))
  )

df2
#   product              ingredients fibre_present fibre_count      fibre_used
#1    apple          flour|fibre|500          TRUE           2      fibre, 500
#2   banana                sugar|500          TRUE           1             500
#3   cherry     505|wheat|flavouring         FALSE           0                
#4   durian          fibre(500)|eggs          TRUE           1      fibre(500)
#5 eggplant wholegrainrice|sesameoil         FALSE           0                
#6     fuyu            500|fibre|500          TRUE           3 500, fibre, 500
© www.soinside.com 2019 - 2024. All rights reserved.