如何在 R/Python 中根据截断信息和正则表达式制作 data.frame?

问题描述 投票:0回答:1

我在 R 中有一个 data.frame,其中有一列和截断的行,我想通过正则表达式将其分为三列,例如:

dados <- c(
  "N 2022NE001264 75",
  "FRETES INTERNACIONAIS LTDA                                 7.500,00 C",
  "N 2022NE000286 84",
  "UNIRIO UNIVERSIDADE FEDERAL DO ESTADO RJ                         2.856,71 C",
  "N 2022NE001297 48                                                   720,00 C",
  "N 2022NE001333 16",
  "CASTRO COMERCIO LTDA                       5.256,00 C",
  "N 2022NE001353 92",
  "CONSTRUCOES E INSTALACOES LTDA                           734,20 C",
  "N 2022NE000279 12",
  "UNIRIO UNIVERSIDADE FEDERAL DO ESTADO RJ                            180,00 C",
  "N 2022NE000293 12",
  "EQUIPAMENTOS E PRODUTOS PARA LABORATORIOS L               1.716,00 C"
)

dados_df <- data.frame(V1 = dados)

使用正则表达式,我想将我的 data.frame 转换成这样的

nota                     org                                     value 
N 2022NE001264 75   FRETES INTERNACIONAIS LTDA                  1500.00 C
N 2022NE000286 84   UNIRIO UNIVERSIDADE FEDERAL DO ESTADO RJ    2856.71 C
N 2022NE000297 48                                                720.00 C
N 2022NE001333 16   CASTRO COMERCIO LTDA                        5256.00 C
N 2022NE001353 92   CONSTRUCOES E INSTALACOES LTDA               734.00 C
N 2022NE000279 12   UNIRIO UNIVERSIDADE FEDERAL DO ESTADO RJ     180.00 C
N 2022NE000293 12   EQUIPAMENTOS E PRODUTOS PARA LABORATORIOS L 1716.00 C

这是我迄今为止开发的代码:

library(dplyr)
library(tidyr)
library(stringr)

dados_df <- dados_df %>%
  mutate(
    nota = str_extract(V1, "^N\\s\\d+NE\\d+\\s\\d+\\s*"),
    org = str_extract(V1, "(?<=\\d\\s{1,}).*(?=\\s{2,}\\d)"),
    value = str_extract(V1, "\\d+\\.?\\d*,\\d{2}\\sC$")
  )
python r stringr
1个回答
0
投票

根据示例,我对您的数据做出以下假设:

  • 每条记录分为2行
  • Line1 包含
    nota
    变量
  • Line2 包含
    org
    value
    变量
  • 有时 Line2 会丢失。在本例中,Line1 包含
    org
    value
    变量。

如果这些假设成立,那么我们可以:

  • 创建一个与
    flag
    模式匹配的
    nota
    变量
  • 使用它创建一个
    id
    变量来唯一标识每条记录
  • 使用
    pivot_wider()
    将两个字符串放在一行
  • paste
    记录在一起
  • 使用正则表达式提取所需的变量
dados_df <- dados_df %>%
  mutate(
    # flag which line we are reading, line1 or line2
    flag = ifelse(str_detect(V1, "^N\\s\\d+NE\\d+\\s\\d+\\s*"), "line1", "line2"),
    # create a record identifier
    id = cumsum(flag == "line1")
    ) %>%
  # pivot to get each record onto 1 line
  pivot_wider(id_cols = id, names_from = flag, values_from = V1) %>%
  # replace NA with "" to avoid problems with paste below
  replace_na(list(line2 = "")) %>%
  mutate(
    # join the 2 strings into 1 
    line = paste(line1, line2),
    # put the final \\s* inside a look-ahead group to remove trailing blanks from nota
    nota = str_extract(line, "^N\\s\\d+NE\\d+\\s\\d+(?=\\s*)"),
    # the look-behind group needed to be changed here bacause the nota and org are now
    # part of a single string
    org = str_trim(str_extract(line, "(?<=\\d\\s\\d{2}\\s).*(?=\\s{2,}\\d)")),
    # added a look-ahead group at the end to remove any spaces created by paste
    value = str_extract(line, "\\d+\\.?\\d*,\\d{2}\\sC(?=\\s?$)")
  ) %>%
  select(nota, org, value)

dados_df

# A tibble: 7 × 3
  nota              org                                         value     
  <chr>             <chr>                                       <chr>     
1 N 2022NE001264 75 FRETES INTERNACIONAIS LTDA                  7.500,00 C
2 N 2022NE000286 84 UNIRIO UNIVERSIDADE FEDERAL DO ESTADO RJ    2.856,71 C
3 N 2022NE001297 48 FRETES INTERNACIONAIS LTDA                  720,00 C  
4 N 2022NE001333 16 CASTRO COMERCIO LTDA                        5.256,00 C
5 N 2022NE001353 92 CONSTRUCOES E INSTALACOES LTDA              734,20 C  
6 N 2022NE000279 12 UNIRIO UNIVERSIDADE FEDERAL DO ESTADO RJ    180,00 C  
7 N 2022NE000293 12 EQUIPAMENTOS E PRODUTOS PARA LABORATORIOS L 1.716,00 C
© www.soinside.com 2019 - 2024. All rights reserved.