一次转换一个数据帧的多列类型

问题描述 投票:38回答:10

我似乎花了很多时间从文件,数据库或其他东西创建一个数据框,然后将每一列转换成我想要的类型(数字,因子,字符等)。有没有一种方法可以一步一步做到这一点,可能是通过提供一个类型为vector的方法?

foo<-data.frame(x=c(1:10), 
                y=c("red", "red", "red", "blue", "blue", 
                    "blue", "yellow", "yellow", "yellow", 
                    "green"),
                z=Sys.Date()+c(1:10))

foo$x<-as.character(foo$x)
foo$y<-as.character(foo$y)
foo$z<-as.numeric(foo$z)

代替最后三个命令,我想做类似的事情

foo<-convert.magic(foo, c(character, character, numeric))
r type-conversion
10个回答
35
投票

Edit有关此基本思想的一些简化和扩展,请参见this相关问题。

我对[[0]的布兰登答案的评论:

switch

对于真正的大数据帧,您可能需要使用convert.magic <- function(obj,types){ for (i in 1:length(obj)){ FUN <- switch(types[i],character = as.character, numeric = as.numeric, factor = as.factor) obj[,i] <- FUN(obj[,i]) } obj } out <- convert.magic(foo,c('character','character','numeric')) > str(out) 'data.frame': 10 obs. of 3 variables: $ x: chr "1" "2" "3" "4" ... $ y: chr "red" "red" "red" "blue" ... $ z: num 15254 15255 15256 15257 15258 ... 而不是lapply循环:

for

[执行此操作时,请注意R中强制数据的某些复杂性。例如,从因数转换为数字通常涉及convert.magic1 <- function(obj,types){ out <- lapply(1:length(obj),FUN = function(i){FUN1 <- switch(types[i],character = as.character,numeric = as.numeric,factor = as.factor); FUN1(obj[,i])}) names(out) <- colnames(obj) as.data.frame(out,stringsAsFactors = FALSE) } 。另外,请注意as.numeric(as.character(...))data.frame()的将字符转换为因数的默认行为。


0
投票

使用foo[] <- lapply(foo, readr::parse_guess) #'data.frame': 10 obs. of 3 variables: # $ x: num 1 2 3 4 5 6 7 8 9 10 # $ y: chr "red" "red" "red" "blue" ... # $ z: Date, format: "2019-08-12" "2019-08-13" "2019-08-14" "2019-08-15" ... foo <- transform(foo, x=as.character(x), y=as.character(y), z=as.numeric(z))

purrr

18
投票

如果您想自动检测列的数据类型而不是手动指定(例如,在整理数据后等),则功能as.data.frame()可能会有所帮助。

函数type.convert()接受字符向量,并尝试确定所有元素的最佳类型(这意味着必须每列应用一次)。

type.convert()

因为我爱type.convert(),所以我更喜欢:

df[] <- lapply(df, function(x) type.convert(as.character(x)))

7
投票

我发现我也经常遇到这个问题。这是关于如何导入数据的。所有的read ...()函数都有某种类型的选项,用于指定不将字符串转换为因数。这意味着文本字符串将保持字符状态,而看起来像数字的事物将保持数字状态。当您的元素为空而不是不适用时,就会出现问题。但同样,na.strings = c(“”,...)也应解决该问题。首先,我将仔细研究您的导入过程并进行相应的调整。

但是您总是可以创建一个函数并通过该字符串。

dplyr

6
投票

我知道我回答的时间很晚,但是将循环与attributes函数一起使用是解决您的问题的简单方法。

library(dplyr)
df <- df %>% mutate_all(funs(type.convert(as.character(.))))

2
投票

我只是用RSQLite提取方法遇到了类似的事情……结果以原子数据类型返回。就我而言,这是一个日期时间戳,这使我感到沮丧。我发现convert.magic <- function(x, y=NA) { for(i in 1:length(y)) { if (y[i] == "numeric") { x[i] <- as.numeric(x[[i]]) } if (y[i] == "character") x[i] <- as.character(x[[i]]) } return(x) } foo <- convert.magic(foo, c("character", "character", "numeric")) > str(foo) 'data.frame': 10 obs. of 3 variables: $ x: chr "1" "2" "3" "4" ... $ y: chr "red" "red" "red" "blue" ... $ z: num 15254 15255 15256 15257 15258 ... 函数对于帮助使names <- c("x", "y", "z") chclass <- c("character", "character", "numeric") for (i in (1:length(names))) { attributes(foo[, names[i]])$class <- chclass[i] } 正常工作非常有用。这是我的小例子。

setAs

1
投票

@ joran的答案的补充,其中as不会在因子到数字的转换中保留数值:

##data.frame conversion function
convert.magic2 <- function(df,classes){
  out <- lapply(1:length(classes),
                FUN = function(classIndex){as(df[,classIndex],classes[classIndex])})
  names(out) <- colnames(df)
  return(data.frame(out))
}

##small example case
tmp.df <- data.frame('dt'=c("2013-09-02 09:35:06", "2013-09-02 09:38:24", "2013-09-02 09:38:42", "2013-09-02 09:38:42"),
                     'v'=c('1','2','3','4'),
                     stringsAsFactors=FALSE)
classes=c('POSIXct','numeric')
str(tmp.df)
#confirm that it has character datatype columns
##  'data.frame':  4 obs. of  2 variables:
##    $ dt: chr  "2013-09-02 09:35:06" "2013-09-02 09:38:24" "2013-09-02 09:38:42" "2013-09-02 09:38:42"
##    $ v : chr  "1" "2" "3" "4"

##is the dt column coerceable to POSIXct?
canCoerce(tmp.df$dt,"POSIXct")
##  [1] FALSE

##and the conver.magic2 function fails also:
tmp.df.n <- convert.magic2(tmp.df,classes)

##  Error in as(df[, classIndex], classes[classIndex]) : 
##    no method or default for coercing “character” to “POSIXct” 

##ittle reading reveals the setAS function
setAs('character', 'POSIXct', function(from){return(as.POSIXct(from))})

##better answer for canCoerce
canCoerce(tmp.df$dt,"POSIXct")
##  [1] TRUE

##better answer from conver.magic2
tmp.df.n <- convert.magic2(tmp.df,classes)

##column datatypes converted as I would like them!
str(tmp.df.n)

##  'data.frame':  4 obs. of  2 variables:
##    $ dt: POSIXct, format: "2013-09-02 09:35:06" "2013-09-02 09:38:24" "2013-09-02 09:38:42" "2013-09-02 09:38:42"
##   $ v : num  1 2 3 4

以下应保留数值:

convert.magic

1
投票

一个稍微简单的data.table解决方案,但是如果要更改为许多不同的列类型,则将需要一些步骤。

convert.magic <- function(obj,types){
    out <- lapply(1:length(obj),FUN = function(i){FUN1 <- switch(types[i],
    character = as.character,numeric = as.numeric,factor = as.factor); FUN1(obj[,i])})
    names(out) <- colnames(obj)
    as.data.frame(out,stringsAsFactors = FALSE)
}

foo<-data.frame(x=c(1:10), 
                    y=c("red", "red", "red", "blue", "blue", 
                        "blue", "yellow", "yellow", "yellow", 
                        "green"),
                    z=Sys.Date()+c(1:10))

foo$x<-as.character(foo$x)
foo$y<-as.character(foo$y)
foo$z<-as.numeric(foo$z)

str(foo)
# 'data.frame': 10 obs. of  3 variables:
# $ x: chr  "1" "2" "3" "4" ...
# $ y: chr  "red" "red" "red" "blue" ...
# $ z: num  16777 16778 16779 16780 16781 ...

foo.factors <- convert.magic(foo, rep("factor", 3))

str(foo.factors) # all factors

foo.numeric.not.preserved <- convert.magic(foo.factors, c("numeric", "character", "numeric"))

str(foo.numeric.not.preserved)
# 'data.frame': 10 obs. of  3 variables:
# $ x: num  1 3 4 5 6 7 8 9 10 2
# $ y: chr  "red" "red" "red" "blue" ...
# $ z: num  1 2 3 4 5 6 7 8 9 10

# z comes out as 1 2 3...

这会将## as.numeric function that preserves numeric values when converting factor to numeric as.numeric.mod <- function(x) { if(is.factor(x)) as.numeric(levels(x))[x] else as.numeric(x) } ## The same than in @joran's answer, except for as.numeric.mod convert.magic <- function(obj,types){ out <- lapply(1:length(obj),FUN = function(i){FUN1 <- switch(types[i], character = as.character,numeric = as.numeric.mod, factor = as.factor); FUN1(obj[,i])}) names(out) <- colnames(obj) as.data.frame(out,stringsAsFactors = FALSE) } foo.numeric <- convert.magic(foo.factors, c("numeric", "character", "numeric")) str(foo.numeric) # 'data.frame': 10 obs. of 3 variables: # $ x: num 1 2 3 4 5 6 7 8 9 10 # $ y: chr "red" "red" "red" "blue" ... # $ z: num 16777 16778 16779 16780 16781 ... # z comes out with the correct numeric values 中指定的列以外的所有列更改为数字(或在dt <- data.table( x=c(1:10), y=c(10:20), z=c(10:20), name=letters[1:10]) dt <- dt[, lapply(.SD, as.numeric), by= name] 中设置的任何列]


1
投票

类似于by,也有lapply无需指定即可将数据帧转换为适当的类

type.convert(foo, as.is = TRUE)

如果将所有列都保留为字符,我们也可以使用readr::type_convert,它将自动将数据框转换为正确的类。考虑修改后的数据框

readr::type_convert(foo) 

在每列上应用readr::parse_guess

foo <- data.frame(x = as.character(1:10), 
                  y = c("red", "red", "red", "blue", "blue", "blue", "yellow", 
                     "yellow", "yellow", "green"),
                  z = as.character(Sys.Date()+c(1:10)), stringsAsFactors = FALSE)

str(foo)

#'data.frame':  10 obs. of  3 variables:
# $ x: chr  "1" "2" "3" "4" ...
# $ y: chr  "red" "red" "red" "blue" ...
# $ z: chr  "2019-08-12" "2019-08-13" "2019-08-14" "2019-08-15" ...

0
投票

转换似乎就是您要描述的内容:

parse_guess
© www.soinside.com 2019 - 2024. All rights reserved.