从r中的段落/字符串中提取不同的百分比/数字

Question

我是R语言的新手，正在努力从数据帧的字符串中提取百分比/数字。例如，

df <- data.frame(
  Species =c("Bidens pilosa","Orobanche ramose"),
  Impact = c("Soyabean yield loss was 10%. A density of one plant resulted in a yield loss of 9.4%; two plants, 17.3%; and four to eight plants, 28%...In contrast, suppression of the weed by the crop was only 10%","Cypress was estimated to have a 28% loss annually. The annual increase of the disease in some stands in the Peloponnesus, with an initial attack of 20%, ranged from 5% to 20% ")

我的问题如下：

在这种情况下，我只想提取不同作物的产量损失，分别为10和28，希望跳过其他方面的百分比和数字（例如9.4％，17.3％，5 *等）。我通过R实现这个目标吗？还是需要一些有关自然语言处理的技能？
如果很难区分不同类型的百分比，如何一次提取所有百分比/数字，以便我可以手动选择正确的数字。我尝试使用

df %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric

或

parse_number(df$Impact)

但是我认为它们都不起作用，因为它们给了我连续的数字。

感谢您的帮助。

Answer 1

1）关于如何提取良率损失没有明确的模式。在第一个字符串中，我两次提到了“产量损失”。

大豆产量损失为10％。一棵植物的密度导致产量损失9.4％；

所以至少我不知道为什么选择10而不是9.4。

2）要提取所有百分比/数字，可以使用：

stringr::str_extract_all(df$Impact, "\\d+\\.?\\d?")

#[[1]]
#[1] "10"   "9.4"  "17.3" "28"   "10"  

#[[2]]
#[1] "28" "20" "5"  "20"

相当于

regmatches(df$Impact, gregexpr("\\d+\\.?\\d?", df$Impact))

以R为基数

[\\d+表示1或大于1位数字

[\\.?是可选的小数位

\\d?是可选数字。

从r中的段落/字符串中提取不同的百分比/数字

问题描述投票：0回答：1

1个回答

最新问题

从r中的段落/字符串中提取不同的百分比/数字

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1