我有一个包含 25 列的数据表“data”。在某些列(大约 15 列)中,包含数值(但在导入后定义为字符),我想替换某些字符,例如“,”乘“.”,“<" by "", ">”乘“”等(可以是10个或更多组合),因为有些值是这样的“<0,17" or "> 1,5”。
随着列名的更改(因为它影响不同的数据表),我想用这种方式解决它(我编码的内容不正确,只是为了显示我想要做什么)。
replace <- list ("," = ".", "<" = "", ">" = "")
affectedColumns = c("name1", "name2", "name3" ... "name 14", "name 15").
mydata %>%
mutate(affectedColumns, replace)
另一个问题是,有些列是数字,有些是字符。首先将“affectedColumns”中的所有值转换为字符(as.character)>然后进行替换过程,然后将它们全部转换回数字(as.numeric)是否有意义?
最后我想要带有“.”的值。作为逗号并且没有任何“<" or ">”或空格。
有办法做到这一点吗? 谢谢!
这是基本的 R 方式。
mydata[affectedColumns] <- lapply(mydata[affectedColumns], \(x){
for(nm in names(replace)) x <- sub(nm, replace[nm], x)
as.numeric(x)
})
您可以使用
parse_number
包中的 readr
转换为数字,同时删除大于/小于符号。
library(readr)
df <- data.frame("name1" = c("1,5", "> 1,5", "<1,6"),
"name2" = c("1,5", "1,5", "1,5"),
"name3" = c("1,0", "1", "1"),
"name4" = c(1.5, 1, 0.5)
)
affectedColumns <- c("name1", "name2", "name3")
new_df <- mutate(df, across(affectedColumns, .fns = ~parse_number(.x, locale = locale(decimal_mark = ","))))
这是一个
dplyr
解决方案:
library(dplyr)
mydata %>%
# Step 1: remove < and >:
mutate(across(c(everything()),
~ sub("\\s?(>|<)", "", .))) %>%
# Step 2: replace dot by comma:
mutate(across(c(everything()),
~ sub("\\.", ",", .)))
col1 col2
1 1,2 12,701
2 3 55,77
3 5 5000
编辑:
这是使用
setNames
和 stringr
的解决方案:
首先定义新值和旧值集(确保转义正则表达式元字符,例如
.
):
replacements <- setNames(c("", "", ","), # new values
c("<", ">", "\\.")) # old values
或者,更经济一点:
replacements <- setNames(c("", ","), # new values
c("<|>", "\\.")) # old values
现在使用
str_replace_all
一次性实施更改:
library(stringr)
mydata %>%
mutate(across(c(col1:col2),
~ str_replace_all(., replacements)))
玩具数据:
mydata <- data.frame(
col1 = c("1.2", "3", "<5"),
col2 = c(">12.701", "55,77", "< 5000")
)
structure(list(D = c(12327, 12328, 12329, 12330, 12331, 12333,
12334, 12335, 12336, 12337, 12338, 12339, 12340, 12343, 12345,
12348, 12349, 12350, 12351, 12352), E = c(12310, 12310, 12326,
12326, 12315, 12326, 0, 12324, 12324, 12334, 12334, 0, 12339,
0, 0, 12345, 12345, 0, 12343, 12343), Basiswert = c("AUDCAD",
"AUDCAD", "USDJPY", "USDJPY", "USDCAD", "USDJPY", "USDCHF", "USDCHF",
"USDCHF", "USDCHF", "USDCHF", "USDCAD", NA, "USDCAD", "CADJPY",
"CADJPY", "CADJPY", "USDCHF", "USDCAD", "USDCAD"), Einstieg = c(NA,
0.89262, NA, 139.192, NA, NA, 0.9052, NA, 0.90834, NA, 0.90816,
NA, NA, 1.362, 103.188, NA, 102.886, 0.9051, NA, 1.36504), Profit = c(33,
NA, 34, NA, 68, 68, NA, 33, NA, 33, NA, NA, NA, NA, NA, 34, NA,
NA, 33, NA), SL = c(NA, NA, NA, NA, NA, NA, 0.91134, NA, NA,
NA, NA, NA, NA, 1.3684, 102.545, NA, NA, 0.91138, NA, NA), TP = c(NA,
NA, NA, NA, NA, NA, 0.89325, NA, NA, NA, NA, NA, NA, 1.3504,
104.35, NA, NA, 0.8933, NA, NA), Trader = c(NA, NA, NA, NA, NA,
NA, "Trade by Jason\" ", NA, NA, NA, NA, NA, NA, "Trade by Jason\" ",
"Trade by Jason\" ", NA, NA, "Trade by Jason\" ", NA, NA)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -20L), groups = structure(list(
E = c(0, 12310, 12315, 12324, 12326, 12334, 12339, 12343,
12345), .rows = structure(list(c(7L, 12L, 14L, 15L, 18L),
1:2, 5L, 8:9, c(3L, 4L, 6L), 10:11, 13L, 19:20, 16:17), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -9L), .drop = TRUE))
非常感谢您的努力和解决方案。然而,我并没有研究整个数据集。请参阅上面的示例。
考虑将
mutate
、across
和 case_when
函数组合起来形成 dplyr
包。您可以在这里找到它们:https://dplyr.tidyverse.org/reference/across.html和这里:https://dplyr.tidyverse.org/reference/case_when.html或给出一个最小的可重现示例。
最好的, M.