R:选择具有相同通用模式的字符串

问题描述 投票:0回答:1

我有一个

strings
的清单如下:

> with(providers, head(Provider.Name, 30))
 [1] 1st Care (UK) Limited                 
 [2] 1st Care Limited                      
 [3] 229 Mitcham Lane Limited              
 [4] 24-7 Care Ltd                         
 [5] 3 Dimensions Care Limited             
 [6] 3 Trees Community Support Limited     
 [7] 365 Care Homes Limited                
 [8] 3A Care (Solihull) Limited            
 [9] 3L Care Limited                       
[10] 5 Star TLC Limited                    
[11] 92 Higher Drive Limited               
[12] A & I Care Home Ltd                   
[13] A & L Care Homes Limited              
[14] A & N Kachra                          
[15] A & R Care Limited                    
[16] A Better Carehome Ltd                 
[17] A.G.E. Nursing Homes Limited          
[18] A.R.M. Healthcare Limited             
[19] AAA Elderly Care Limited              
[20] AAA Medics Ltd                        
[21] Aadams Residential Care Home Limited  
[22] Abacus Quality Care Ltd               
[23] Abberdale Limited                     
[24] Abbeville RCH Limited                 
[25] Abbey Care Centre Limited             
[26] Abbey Care Direct Ltd                 
[27] Abbey Care Home Limited               
[28] Abbey Healthcare (Aaron Court) Limited
[29] Abbey Healthcare (Kendal) Limited     
[30] Abbey Healthcare (Knebworth) Ltd  

我的目标是识别那些遵循类似模式的观察结果,然后用这种模式相应地重命名它们。理想情况下,输出应类似于以下内容(请特别注意观察值

1
2
25
更改为
30

> with(providers, head(Provider.Name, 30))
     [1] 1st Care Limited                 
     [2] 1st Care Limited                      
     [3] 229 Mitcham Lane Limited              
     [4] 24-7 Care Ltd                         
     [5] 3 Dimensions Care Limited             
     [6] 3 Trees Community Support Limited     
     [7] 365 Care Homes Limited                
     [8] 3A Care (Solihull) Limited            
     [9] 3L Care Limited                       
    [10] 5 Star TLC Limited                    
    [11] 92 Higher Drive Limited               
    [12] A & I Care Home Ltd                   
    [13] A & L Care Homes Limited              
    [14] A & N Kachra                          
    [15] A & R Care Limited                    
    [16] A Better Carehome Ltd                 
    [17] A.G.E. Nursing Homes Limited          
    [18] A.R.M. Healthcare Limited             
    [19] AAA Elderly Care Limited              
    [20] AAA Medics Ltd                        
    [21] Aadams Residential Care Home Limited  
    [22] Abacus Quality Care Ltd               
    [23] Abberdale Limited                     
    [24] Abbeville RCH Limited                 
    [25] Abbey Care             
    [26] Abbey Care                 
    [27] Abbey Care              
    [28] Abbey Healthcare 
    [29] Abbey Healthcare    
    [30] Abbey Healthcare  

我的问题是如何编写类似“一般模式”的内容,以便能够提取有效具有相同模式的观察结果。我已经尝试过

str_extract
但我认为我在编写一般模式时遗漏了一些东西。

library(stringr)
home = "[a-zA-Z]{2,}" # Select general pattern that where the first 2 words are similar
test = with(providers, str_extract(Provider.Name, home))

有人知道 R 中是否有一个函数可以普遍识别模式吗?提前谢谢了。

r pattern-matching stringr
1个回答
0
投票

这是一种方法:

pacman::p_load(tidyverse)

df <- tibble(
    carehomes = c(
  "1st Care (UK) Limited",
  "1st Care Limited",
  "229 Mitcham Lane Limited",
  "24-7 Care Ltd",
  "3 Dimensions Care Limited",
  "3 Trees Community Support Limited",
  "365 Care Homes Limited",
  "3A Care (Solihull) Limited",
  "3L Care Limited",
  "5 Star TLC Limited",
  "92 Higher Drive Limited",
  "A & I Care Home Ltd",
  "A & L Care Homes Limited",
  "A & N Kachra",
  "A & R Care Limited",
  "A Better Carehome Ltd",
  "A.G.E. Nursing Homes Limited",
  "A.R.M. Healthcare Limited",
  "AAA Elderly Care Limited",
  "AAA Medics Ltd",
  "Aadams Residential Care Home Limited",
  "Abacus Quality Care Ltd",
  "Abberdale Limited",
  "Abbeville RCH Limited",
  "Abbey Care Centre Limited",
  "Abbey Care Direct Ltd",
  "Abbey Care Home Limited",
  "Abbey Healthcare (Aaron Court) Limited",
  "Abbey Healthcare (Kendal) Limited",
  "Abbey Healthcare (Knebworth) Ltd"
)
)

df |> 
  # we create a new column with the junk at the end removed
  mutate(shortened = str_remove(carehomes, "(?: \\(.*\\)|) (?:Limited|Ltd)")) |>
  # we then group by the shortened value. If there is more than one item in the group, then we select the shortened version. Otherwise, we select the original 
  mutate(carehomes = ifelse(n() > 1, shortened, carehomes), .by = shortened) |> 
  # remove the shortened column once we no longer need it
  select(-shortened)
© www.soinside.com 2019 - 2024. All rights reserved.