将长字符串分割成三个变量

Question

我有一个数据框架，看起来像这样。

df<-structure(list(string = c(" Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3", 
" Automatic data processing machines and units thereof ............................................ E ....................... 15.0", 
" Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6", 
" Optical instruments and apparatus .............................................................................. E ....................... 14.1", 
" Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3", 
" Silk .................................................................................................................................. A ....................... 13.2", 
" Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1", 
" Articles of materials described in division 58 ............................................................. D ....................... 13.1"
), id = c("1 ", "2 ", "3 ", "4 ", "5 ", "6 ", "7 ", "8 "), SH3 = c("776 ", 
"752 ", "759 ", "871 ", "553 ", "261 ", "846 ", "893 ")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))


# that looks like this

  string                                                                                                                                                                    id    SH3   
  <chr>                                                                                                                                                                     <chr> <chr> 
1 " Thermionic, cold and photo-cathode valves, tubes, and parts .................................. E ....................... 16.3"                                          "1 "  "776 "
2 " Automatic data processing machines and units thereof ............................................ E ....................... 15.0"                                       "2 "  "752 "
3 " Parts of and accessories suitable for 751, 752 .......................................................... E ....................... 14.6"                               "3 "  "759 "
4 " Optical instruments and apparatus .............................................................................. E ....................... 14.1"                        "4 "  "871 "
5 " Perfumery, cosmetics and toilet preparations ............................................................. E ....................... 13.3"                              "5 "  "553 "
6 " Silk .................................................................................................................................. A ....................... 13.2" "6 "  "261 "
7 " Undergarments, knitted or crocheted .......................................................................... B ....................... 13.1"                          "7 "  "846 "
8 " Articles of materials described in division 58 ............................................................. D ....................... 13.1"                            "8 "  "893 "

我想把 string 变量分为三个独立的变量。string 有3个部分，由一系列的点(...)分开

1) 第一部分由一些文字组成：例如，在第1行 "热敏阀、冷阀和光阴极阀、管子和零件"

2）第二部分是大写字母：如第1行："E"

3) 最后一部分是一个数字:例如在第1行是 "16.3"。

我想拆分我的字符串，并从中创建三个变量。问题是每一行的点的数量都不同。有谁知道如何一个有效的方法来做？

一个有效的方式来隔离大写字母（第2部分）将是足够的对我的目的。

非常感谢您的帮助

Answer 1

你可以使用一个寻找点的regex，即 "点"。[.] 二段以上 {2,}:

strsplit(df$string, "[.]{2,}")[1:3]
# [[1]]
# [1] " Thermionic, cold and photo-cathode valves, tubes, and parts "
# [2] " E "                                                          
# [3] " 16.3"                                                        
# [[2]]
# [1] " Automatic data processing machines and units thereof " " E "                                                   
# [3] " 15.0"                                                 
# [[3]]
# [1] " Parts of and accessories suitable for 751, 752 " " E "                                             
# [3] " 14.6"

有了这个，你可以把它转换成一个框架。

data.frame(do.call(rbind, strsplit(df$string, "[.]{2,}")), stringsAsFactors = FALSE)
#                                                              X1  X2    X3
# 1  Thermionic, cold and photo-cathode valves, tubes, and parts   E   16.3
# 2         Automatic data processing machines and units thereof   E   15.0
# 3               Parts of and accessories suitable for 751, 752   E   14.6
# 4                            Optical instruments and apparatus   E   14.1
# 5                 Perfumery, cosmetics and toilet preparations   E   13.3
# 6                                                         Silk   A   13.2
# 7                          Undergarments, knitted or crocheted   B   13.1
# 8               Articles of materials described in division 58   D   13.1

你要重新命名，很可能 trimws 和 as.numeric 某些栏目，如 strsplit 没有修剪字符串。

如果你需要的只是第二列，那么就可以用

trimws(sapply(strsplit(df$string, "[.]{2,}"), `[[`, 2))
# [1] "E" "E" "E" "E" "E" "A" "B" "D"

将长字符串分割成三个变量

问题描述投票：1回答：1

1个回答

最新问题

将长字符串分割成三个变量

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1