我有一个字符串,看起来像:
clean_text
[1] "01/04/2018 Japan - Ghana 7:1 04/04/2018 Turkey - Estonia 3:2 06/04/2018 USA - Mexico 4:1 France - Nigeria 8:0 07/04/2018 Turkey - Estonia 3:0 08/04/2018 USA - Mexico 6:2 09/04/2018 France - Canada 1:0 10/04/2018 Cuba - Nicaragua 4:2 12/04/2018 Cuba - Nicaragua 1:2 18/04/2018 St. Vincent/Grenadines - St. Lucia 0:1 St. Kitts & Nevis - Dominica 1:0 Cuba - Barbados 7:0 19/04/2018 Haiti - Virgin Islands 7:0 20/04/2018 St. Lucia - Dominica 0:0 St. Kitts & Nevis - St. Vincent/Grenadines 2:0 Jamaica - Barbados 3:2 21/04/2018 Virgin Islands - Haiti 0:14 22/04/2018 Dominica - St. Vincent/Grenadines 3:0 St. Kitts & Nevis - St. Lucia 0:1 Jamaica - Cuba 0:1 25/04/2018 Guyana - Grenada 0:0 Trinidad & Tobago - Suriname 7:0 27/04/2018 Suriname - Guyana 2:2 Antigua & Barbuda - Curaçao 2:1 Trinidad & Tobago - Grenada 8:1 29/04/2018 Grenada - Suriname 5:6 Trinidad & Tobago - Guyana 3:1 "
我想对其进行预处理,以便获得如下列表:日本、加纳、土耳其、爱沙尼亚、美国等,即由“-”分隔的团队名称。
我正在尝试代码:
pattern <- "[[:alpha:]][[:alpha:] -]*[[:alpha:]]"
matches <- str_extract_all(clean_text, pattern)[[1]]
给我的清单是:
[1] "Japan - Ghana" "Turkey - Estonia"
[3] "USA - Mexico" "France - Nigeria"
[5] "Turkey - Estonia" "USA - Mexico"
[7] "France - Canada" "Cuba - Nicaragua"
[9] "Cuba - Nicaragua" "St"
[11] "Vincent" "Grenadines - St"
[13] "Lucia" "St"
[15] "Kitts" "Nevis - Dominica"
[17] "Cuba - Barbados" "Haiti - Virgin Islands"
[19] "St" "Lucia - Dominica"
[21] "St" "Kitts"
[23] "Nevis - St" "Vincent"
[25] "Grenadines" "Jamaica - Barbados"
[27] "Virgin Islands - Haiti" "Dominica - St"
[29] "Vincent" "Grenadines"
[31] "St" "Kitts"
[33] "Nevis - St" "Lucia"
[35] "Jamaica - Cuba" "Guyana - Grenada"
[37] "Trinidad" "Tobago - Suriname"
[39] "Suriname - Guyana" "Antigua"
[41] "Barbuda - Curaçao" "Trinidad"
[43] "Tobago - Grenada" "Grenada - Suriname"
[45] "Trinidad" "Tobago - Guyana
这是错误的,因为它将字符串拆分为“.”或“&”或“-”存在事实上我只希望字符串在存在“-”的地方拆分我应该在我的代码中进行哪些更改?
也许更多的迭代方法在这里有帮助:
library(stringr)
s <- "01/04/2018 Japan - Ghana 7:1 04/04/2018 Turkey - Estonia 3:2 06/04/2018 USA - Mexico 4:1 France - Nigeria 8:0 07/04/2018 Turkey - Estonia 3:0 08/04/2018 USA - Mexico 6:2 09/04/2018 France - Canada 1:0 10/04/2018 Cuba - Nicaragua 4:2 12/04/2018 Cuba - Nicaragua 1:2 18/04/2018 St. Vincent/Grenadines - St. Lucia 0:1 St. Kitts & Nevis - Dominica 1:0 Cuba - Barbados 7:0 19/04/2018 Haiti - Virgin Islands 7:0 20/04/2018 St. Lucia - Dominica 0:0 St. Kitts & Nevis - St. Vincent/Grenadines 2:0 Jamaica - Barbados 3:2 21/04/2018 Virgin Islands - Haiti 0:14 22/04/2018 Dominica - St. Vincent/Grenadines 3:0 St. Kitts & Nevis - St. Lucia 0:1 Jamaica - Cuba 0:1 25/04/2018 Guyana - Grenada 0:0 Trinidad & Tobago - Suriname 7:0 27/04/2018 Suriname - Guyana 2:2 Antigua & Barbuda - Curaçao 2:1 Trinidad & Tobago - Grenada 8:1 29/04/2018 Grenada - Suriname 5:6 Trinidad & Tobago - Guyana 3:1"
s |>
str_split_1("\\d+:\\d+") |>
str_remove("\\d{2}/\\d{2}/\\d{4}") |>
str_trim()
#> [1] "Japan - Ghana"
#> [2] "Turkey - Estonia"
#> [3] "USA - Mexico"
#> [4] "France - Nigeria"
#> [5] "Turkey - Estonia"
#> [6] "USA - Mexico"
#> [7] "France - Canada"
#> [8] "Cuba - Nicaragua"
#> [9] "Cuba - Nicaragua"
#> [10] "St. Vincent/Grenadines - St. Lucia"
#> [11] "St. Kitts & Nevis - Dominica"
#> [12] "Cuba - Barbados"
#> [13] "Haiti - Virgin Islands"
#> [14] "St. Lucia - Dominica"
#> [15] "St. Kitts & Nevis - St. Vincent/Grenadines"
#> [16] "Jamaica - Barbados"
#> [17] "Virgin Islands - Haiti"
#> [18] "Dominica - St. Vincent/Grenadines"
#> [19] "St. Kitts & Nevis - St. Lucia"
#> [20] "Jamaica - Cuba"
#> [21] "Guyana - Grenada"
#> [22] "Trinidad & Tobago - Suriname"
#> [23] "Suriname - Guyana"
#> [24] "Antigua & Barbuda - Curaçao"
#> [25] "Trinidad & Tobago - Grenada"
#> [26] "Grenada - Suriname"
#> [27] "Trinidad & Tobago - Guyana"
#> [28] ""
创建于 2023-03-17 与 reprex v2.0.2
拆分目标并删除不必要的空间。
strsplit(x, r'{\d+:\d+}') |> el() |>
gsub(pat=r'{\d+/\d+/\d+|\s+}', repl=' ') |>
trimws() |> {\(.) .[!. == '']}()
# [1] "Japan - Ghana" "Turkey - Estonia"
# [3] "USA - Mexico" "France - Nigeria"
# [5] "Turkey - Estonia" "USA - Mexico"
# [7] "France - Canada" "Cuba - Nicaragua"
# [9] "Cuba - Nicaragua" "St. Vincent/Grenadines - St. Lucia"
# [11] "St. Kitts & Nevis - Dominica" "Cuba - Barbados"
# [13] "Haiti - Virgin Islands" "St. Lucia - Dominica"
# [15] "St. Kitts & Nevis - St. Vincent/Grenadines" "Jamaica - Barbados"
# [17] "Virgin Islands - Haiti" "Dominica - St. Vincent/Grenadines"
# [19] "St. Kitts & Nevis - St. Lucia" "Jamaica - Cuba"
# [21] "Guyana - Grenada" "Trinidad & Tobago - Suriname"
# [23] "Suriname - Guyana" "Antigua & Barbuda - Curaçao"
# [25] "Trinidad & Tobago - Grenada" "Grenada - Suriname"
# [27] "Trinidad & Tobago - Guyana"