我有一个数据框架,看起来像这样。
df<-structure(list(string = c(" Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3",
" Automatic data processing machines and units thereof ............................................ E ....................... 15.0",
" Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6",
" Optical instruments and apparatus .............................................................................. E ....................... 14.1",
" Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3",
" Silk .................................................................................................................................. A ....................... 13.2",
" Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1",
" Articles of materials described in division 58 ............................................................. D ....................... 13.1"
), id = c("1 ", "2 ", "3 ", "4 ", "5 ", "6 ", "7 ", "8 "), SH3 = c("776 ",
"752 ", "759 ", "871 ", "553 ", "261 ", "846 ", "893 ")), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
# that looks like this
string id SH3
<chr> <chr> <chr>
1 " Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3" "1 " "776 "
2 " Automatic data processing machines and units thereof ............................................ E ....................... 15.0" "2 " "752 "
3 " Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6" "3 " "759 "
4 " Optical instruments and apparatus .............................................................................. E ....................... 14.1" "4 " "871 "
5 " Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3" "5 " "553 "
6 " Silk .................................................................................................................................. A ....................... 13.2" "6 " "261 "
7 " Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1" "7 " "846 "
8 " Articles of materials described in division 58 ............................................................. D ....................... 13.1" "8 " "893 "
我想把 string
变量分为三个独立的变量。string
有3个部分,由一系列的点(...)分开
1) 第一部分由一些文字组成:例如,在第1行 "热敏阀、冷阀和光阴极阀、管子和零件"
2)第二部分是大写字母:如第1行:"E"
3) 最后一部分是一个数字:例如在第1行是 "16.3"。
我想拆分我的字符串,并从中创建三个变量。问题是每一行的点的数量都不同。有谁知道如何一个有效的方法来做?
一个有效的方式来隔离大写字母(第2部分)将是足够的对我的目的。
非常感谢您的帮助
你可以使用一个寻找点的regex,即 "点"。[.]
二段以上 {2,}
:
strsplit(df$string, "[.]{2,}")[1:3]
# [[1]]
# [1] " Thermionic, cold and photo-cathode valves, tubes, and parts "
# [2] " E "
# [3] " 16.3"
# [[2]]
# [1] " Automatic data processing machines and units thereof " " E "
# [3] " 15.0"
# [[3]]
# [1] " Parts of and accessories suitable for 751, 752 " " E "
# [3] " 14.6"
有了这个,你可以把它转换成一个框架。
data.frame(do.call(rbind, strsplit(df$string, "[.]{2,}")), stringsAsFactors = FALSE)
# X1 X2 X3
# 1 Thermionic, cold and photo-cathode valves, tubes, and parts E 16.3
# 2 Automatic data processing machines and units thereof E 15.0
# 3 Parts of and accessories suitable for 751, 752 E 14.6
# 4 Optical instruments and apparatus E 14.1
# 5 Perfumery, cosmetics and toilet preparations E 13.3
# 6 Silk A 13.2
# 7 Undergarments, knitted or crocheted B 13.1
# 8 Articles of materials described in division 58 D 13.1
你要重新命名,很可能 trimws
和 as.numeric
某些栏目,如 strsplit
没有修剪字符串。
如果你需要的只是第二列,那么就可以用
trimws(sapply(strsplit(df$string, "[.]{2,}"), `[[`, 2))
# [1] "E" "E" "E" "E" "E" "A" "B" "D"