从字符变量中提取/分类定量信息

问题描述 投票:0回答:2

我有以下类型的

data.frame

id 文字信息
1 从10%增加到60%
2 购买100%
3 从5%增加到45%
4 购买99%

我想处理 text_information (

character
) 变量,以便获得以下输出:

id 分享 分享_差异 类型
1 0.6 0.5 增加
2 1 不适用 购买
3 0.45 0.4 增加
4 0.99 不适用 购买

建议如何使用

R
来完成此操作?

r substring character
2个回答
1
投票

使用正则表达式:

library(dplyr)
library(stringr)

data.frame(
  id = 1:4,
  text_information = c(
    "Increase from 10% to 60%", 
    "Purchase 100%", 
    "Increase from 5% to 45%", 
    "Purchase 99%"
  )
) %>% 
  mutate(
    share_1 = as.numeric(str_extract(text_information, "(\\d+)%", 1)),
    share_2 = as.numeric(str_extract(text_information, "(\\d+)% to (\\d+)%$", 2)),
    share = if_else(is.na(share_2), share_1, share_2) / 100,
    share_difference = (share_2 - share_1) / 100,
    type = tolower(str_extract(text_information, "(Increase|Purchase)"))
  ) %>% 
  select(id, share, share_difference, type)
#>   id share share_difference     type
#> 1  1  0.60              0.5 increase
#> 2  2  1.00               NA purchase
#> 3  3  0.45              0.4 increase
#> 4  4  0.99               NA purchase

创建于 2024-04-08,使用 reprex v2.1.0


0
投票

使用 dplyr 和 tidyr

library(dplyr)
library(tidyr)

#Create dataframe
id <- c(1,2,3,4)
text_information <- c("increase from 10% to 60%", "Purchase 100%","Increase 
from 5% to 45%","Purchase 99%")
df <- data.frame(id,text_information)

# Processing text
df %>% 
  mutate(text_num = trimws(gsub("[^0-9 ]", "", text_information))) %>%
  separate(text_num, into=c("v1","v2","v3"), sep = " ", fill = "left") %>% 
  mutate(share=as.numeric(pmax(v1,v3,na.rm=TRUE))/100,
     share_difference = (as.numeric(v3)-as.numeric(v1))/100,
     type = if_else(is.na(share_difference),'purchase','increase')) %>% 
  select(text_information,share,share_difference,type)
© www.soinside.com 2019 - 2024. All rights reserved.