使用 stringr 函数在 r 中预处理字符串

问题描述 投票:0回答:2

我有一个字符串,看起来像:

clean_text
[1] "01/04/2018   Japan   -   Ghana   7:1    04/04/2018   Turkey   -   Estonia   3:2    06/04/2018   USA   -   Mexico   4:1        France   -   Nigeria   8:0     07/04/2018   Turkey   -   Estonia   3:0    08/04/2018   USA   -   Mexico   6:2     09/04/2018   France   -   Canada   1:0     10/04/2018   Cuba   -   Nicaragua   4:2    12/04/2018   Cuba   -   Nicaragua   1:2    18/04/2018   St. Vincent/Grenadines   -   St. Lucia   0:1       St. Kitts & Nevis   -   Dominica   1:0       Cuba   -   Barbados   7:0    19/04/2018   Haiti   -   Virgin Islands   7:0    20/04/2018   St. Lucia   -   Dominica   0:0       St. Kitts & Nevis   -   St. Vincent/Grenadines   2:0       Jamaica    -   Barbados   3:2    21/04/2018   Virgin Islands   -   Haiti   0:14    22/04/2018   Dominica   -   St. Vincent/Grenadines   3:0       St. Kitts & Nevis   -   St. Lucia   0:1       Jamaica   -   Cuba   0:1    25/04/2018   Guyana   -   Grenada   0:0       Trinidad & Tobago   -   Suriname   7:0    27/04/2018   Suriname   -   Guyana   2:2       Antigua & Barbuda   -   Curaçao   2:1       Trinidad & Tobago   -   Grenada   8:1    29/04/2018   Grenada   -   Suriname   5:6       Trinidad & Tobago   -   Guyana   3:1    "

我想对其进行预处理,以便获得如下列表:日本、加纳、土耳其、爱沙尼亚、美国等,即由“-”分隔的团队名称。

我正在尝试代码:

pattern <- "[[:alpha:]][[:alpha:] -]*[[:alpha:]]"
matches <- str_extract_all(clean_text, pattern)[[1]]

给我的清单是:

[1] "Japan   -   Ghana"          "Turkey   -   Estonia"
[3] "USA   -   Mexico"           "France   -   Nigeria"
[5] "Turkey   -   Estonia"       "USA   -   Mexico"
[7] "France   -   Canada"        "Cuba   -   Nicaragua"
[9] "Cuba   -   Nicaragua"       "St"
[11] "Vincent"                    "Grenadines   -   St"
[13] "Lucia"                      "St"
[15] "Kitts"                      "Nevis   -   Dominica"
[17] "Cuba   -   Barbados"        "Haiti   -   Virgin Islands"
[19] "St"                         "Lucia   -   Dominica"
[21] "St"                         "Kitts"
[23] "Nevis   -   St"             "Vincent"
[25] "Grenadines"                 "Jamaica    -   Barbados"
[27] "Virgin Islands   -   Haiti" "Dominica   -   St"
[29] "Vincent"                    "Grenadines"
[31] "St"                         "Kitts"
[33] "Nevis   -   St"             "Lucia"
[35] "Jamaica   -   Cuba"         "Guyana   -   Grenada"
[37] "Trinidad"                   "Tobago   -   Suriname"
[39] "Suriname   -   Guyana"      "Antigua"
[41] "Barbuda   -   Curaçao"      "Trinidad"
[43] "Tobago   -   Grenada"       "Grenada   -   Suriname"
[45] "Trinidad"                   "Tobago   -   Guyana

这是错误的,因为它将字符串拆分为“.”或“&”或“-”存在事实上我只希望字符串在存在“-”的地方拆分我应该在我的代码中进行哪些更改?

r string stringr data-preprocessing
2个回答
1
投票

也许更多的迭代方法在这里有帮助:

library(stringr)

s <- "01/04/2018   Japan   -   Ghana   7:1    04/04/2018   Turkey   -   Estonia   3:2    06/04/2018   USA   -   Mexico   4:1        France   -   Nigeria   8:0     07/04/2018   Turkey   -   Estonia   3:0    08/04/2018   USA   -   Mexico   6:2     09/04/2018   France   -   Canada   1:0     10/04/2018   Cuba   -   Nicaragua   4:2    12/04/2018   Cuba   -   Nicaragua   1:2    18/04/2018   St. Vincent/Grenadines   -   St. Lucia   0:1       St. Kitts & Nevis   -   Dominica   1:0       Cuba   -   Barbados   7:0    19/04/2018   Haiti   -   Virgin Islands   7:0    20/04/2018   St. Lucia   -   Dominica   0:0       St. Kitts & Nevis   -   St. Vincent/Grenadines   2:0       Jamaica    -   Barbados   3:2    21/04/2018   Virgin Islands   -   Haiti   0:14    22/04/2018   Dominica   -   St. Vincent/Grenadines   3:0       St. Kitts & Nevis   -   St. Lucia   0:1       Jamaica   -   Cuba   0:1    25/04/2018   Guyana   -   Grenada   0:0       Trinidad & Tobago   -   Suriname   7:0    27/04/2018   Suriname   -   Guyana   2:2       Antigua & Barbuda   -   Curaçao   2:1       Trinidad & Tobago   -   Grenada   8:1    29/04/2018   Grenada   -   Suriname   5:6       Trinidad & Tobago   -   Guyana   3:1"

s |> 
  str_split_1("\\d+:\\d+") |> 
  str_remove("\\d{2}/\\d{2}/\\d{4}") |> 
  str_trim()
#>  [1] "Japan   -   Ghana"                             
#>  [2] "Turkey   -   Estonia"                          
#>  [3] "USA   -   Mexico"                              
#>  [4] "France   -   Nigeria"                          
#>  [5] "Turkey   -   Estonia"                          
#>  [6] "USA   -   Mexico"                              
#>  [7] "France   -   Canada"                           
#>  [8] "Cuba   -   Nicaragua"                          
#>  [9] "Cuba   -   Nicaragua"                          
#> [10] "St. Vincent/Grenadines   -   St. Lucia"        
#> [11] "St. Kitts & Nevis   -   Dominica"              
#> [12] "Cuba   -   Barbados"                           
#> [13] "Haiti   -   Virgin Islands"                    
#> [14] "St. Lucia   -   Dominica"                      
#> [15] "St. Kitts & Nevis   -   St. Vincent/Grenadines"
#> [16] "Jamaica    -   Barbados"                       
#> [17] "Virgin Islands   -   Haiti"                    
#> [18] "Dominica   -   St. Vincent/Grenadines"         
#> [19] "St. Kitts & Nevis   -   St. Lucia"             
#> [20] "Jamaica   -   Cuba"                            
#> [21] "Guyana   -   Grenada"                          
#> [22] "Trinidad & Tobago   -   Suriname"              
#> [23] "Suriname   -   Guyana"                         
#> [24] "Antigua & Barbuda   -   Curaçao"               
#> [25] "Trinidad & Tobago   -   Grenada"               
#> [26] "Grenada   -   Suriname"                        
#> [27] "Trinidad & Tobago   -   Guyana"                
#> [28] ""

创建于 2023-03-17 与 reprex v2.0.2


0
投票

拆分目标并删除不必要的空间。

strsplit(x, r'{\d+:\d+}') |> el() |>
  gsub(pat=r'{\d+/\d+/\d+|\s+}', repl=' ') |>
  trimws() |> {\(.) .[!. == '']}()
# [1] "Japan - Ghana"                              "Turkey - Estonia"                          
# [3] "USA - Mexico"                               "France - Nigeria"                          
# [5] "Turkey - Estonia"                           "USA - Mexico"                              
# [7] "France - Canada"                            "Cuba - Nicaragua"                          
# [9] "Cuba - Nicaragua"                           "St. Vincent/Grenadines - St. Lucia"        
# [11] "St. Kitts & Nevis - Dominica"               "Cuba - Barbados"                           
# [13] "Haiti - Virgin Islands"                     "St. Lucia - Dominica"                      
# [15] "St. Kitts & Nevis - St. Vincent/Grenadines" "Jamaica - Barbados"                        
# [17] "Virgin Islands - Haiti"                     "Dominica - St. Vincent/Grenadines"         
# [19] "St. Kitts & Nevis - St. Lucia"              "Jamaica - Cuba"                            
# [21] "Guyana - Grenada"                           "Trinidad & Tobago - Suriname"              
# [23] "Suriname - Guyana"                          "Antigua & Barbuda - Curaçao"               
# [25] "Trinidad & Tobago - Grenada"                "Grenada - Suriname"                        
# [27] "Trinidad & Tobago - Guyana"       
© www.soinside.com 2019 - 2024. All rights reserved.