我有化学求和公式,例如C6H12ON2PS
我希望他们这样订购:
求和公式 | C | H | O | N | P | S |
---|---|---|---|---|---|---|
C6H12ON2PS | 6 | 12 | 1 | 2 | 1 | 1 |
C6H12NP | 6 | 12 | 0 | 1 | 1 | 0 |
我的主要错误是,如果一个字母/元素不存在,并且当一个元素没有数字时,这意味着我需要在该列中添加 1。
我不太擅长 R,因为我刚刚开始,我使用另一个人的脚本,该脚本使用这些格式,但我只有文本。
我试过了
str_split(strsplit(as.character(Form), '(?<=.)(?=[A-Z])', perl=TRUE))
但是当一封信丢失时这不起作用
也许杀伤力太大而且效率不高:
library(tidyverse)
# define sample data
x <- c("C6H12ON2PS", "C6H12NP")
# get list of elements in periodic table
elements <- PeriodicTable::symb(1:116)
# define regex to spot elements without quantity (implicitly "1")
name_regex <- elements |>
str_flatten(collapse = "|") %>%
str_c("(", ., ")(?!\\d)")
# add implicit quantity "1"
x <- c("C6H12ON2PS", "C6H12NP") |>
str_replace_all(name_regex, "\\11")
# define regex that captures both element name and quantity
regex <- str_c(elements, "(?<", elements, ">\\d*)") |>
str_flatten(collapse = "|")
# define helper function to collapse rows (one for each match)
collapse_rows <- function(x) {
if (all(is.na(x))) return(0)
x |> discard(is.na) |> as.numeric()
}
# define helper function to convert search results to tibble
match_to_tibble <- function(m) {
# drop first column (complete match)
m <- m[, -1]
# convert to tibble and collapse rows (one for each capturing group)
m |>
as_tibble() |>
summarize(across(everything(), collapse_rows))
}
# extract quantities
x |>
str_match_all(regex) |>
map(match_to_tibble) |>
bind_rows()
#> # A tibble: 2 × 116
#> H He Li Be B C N O F Ne Na Mg Al
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 12 0 0 0 0 6 2 1 0 0 0 0 0
#> 2 12 0 0 0 0 6 1 0 0 0 0 0 0
#> # ℹ 103 more variables: Si <dbl>, P <dbl>, S <dbl>, Cl <dbl>, Ar <dbl>,
#> # K <dbl>, Ca <dbl>, Sc <dbl>, Ti <dbl>, V <dbl>, Cr <dbl>, Mn <dbl>,
#> # Fe <dbl>, Co <dbl>, Ni <dbl>, Cu <dbl>, Zn <dbl>, Ga <dbl>, Ge <dbl>,
#> # As <dbl>, Se <dbl>, Br <dbl>, Kr <dbl>, Rb <dbl>, Sr <dbl>, Y <dbl>,
#> # Zr <dbl>, Nb <dbl>, Mo <dbl>, Tc <dbl>, Ru <dbl>, Rh <dbl>, Pd <dbl>,
#> # Ag <dbl>, Cd <dbl>, In <dbl>, Sn <dbl>, Sb <dbl>, Te <dbl>, I <dbl>,
#> # Xe <dbl>, Cs <dbl>, Ba <dbl>, La <dbl>, Ce <dbl>, Pr <dbl>, Nd <dbl>, …
创建于 2023-11-02,使用 reprex v2.0.2