我正在尝试从包含数值和阿拉伯文本的字符变量中提取数字,该变量存储在“薪水”列下。我使用 Python 找到了这个here 的解决方案,但我只使用 R.
我用下面的代码试过了,它在列只包含英文文本之前运行良好。我实际上是在尝试创建一个新列“salary_numeric”,它从“salary”列中提取所有数值。
df <-
df%>% mutate(salary_numeric=as.numeric(str_split_fixed(job_posts$salary,fixed(","),3)[,2]))
这里是一个数据示例:
dput(df[1:30,c(22,24)])
输出:
structure(list(salary = c("﷼4,000.00", "﷼4,000.00", "﷼5,000.00",
"﷼12,000.00", "﷼5,000.00", "﷼4,500.00", "﷼100.00", " ",
" ", " ", " ", "﷼6,000.00", " ", "﷼10,000.00", "﷼5,500.00",
" ", "﷼25,688.33", " ", "﷼2,500.00", "﷼8,500.00", "﷼10,000.00",
" ", "﷼4,000.00", "﷼5,000.00", " ", " ", "﷼4,500.00", "﷼10,000.00",
" ", "﷼6,000.00"), salary_numeric = c(0, 0, 0, 0, 0, 500, NA,
NA, NA, NA, NA, 0, NA, 0, 500, NA, 688.33, NA, 500, 500, 0, NA,
0, 0, NA, NA, 500, 0, NA, 0)), row.names = c(NA, -30L), class = c("tbl_df",
"tbl", "data.frame"))
我的代码在提取逗号后和点 (.) 后的值时效果很好,但由于某种原因我无法获取逗号前的值。例如,值“﷼25,688.33”被提取到列中作为“688.33”,但理想情况下应该是:
25688.33