将空白单元格更改为“NA”

问题描述 投票:55回答:10

这是我的数据的link

我的目标是为所有空白单元分配“NA”,而不管分类或数值。我正在使用na.strings =“”。但它没有为所有空白细胞分配NA。

## reading the data
dat <- read.csv("data2.csv")
head(dat)
  mon hr        acc   alc sex spd axles door  reg                                 cond1 drug1
1   8 21 No Control  TRUE   F   0     2    2      Physical Impairment (Eyes, Ear, Limb)     A
2   7 20 No Control FALSE   M 900     2    2                                Inattentive     D
3   3  9 No Control FALSE   F 100     2    2 2004                                Normal     D
4   1 15 No Control FALSE   M   0     2    2      Physical Impairment (Eyes, Ear, Limb)     D
5   4 21 No Control FALSE      25    NA   NA                                                D
6   4 20 No Control    NA   F  30     2    4                Drinking Alcohol - Impaired     D
       inj1 PED_STATE st rac1
1     Fatal      <NA>  F <NA>
2  Moderate      <NA>  F <NA>
3  Moderate      <NA>  M <NA>
4 Complaint      <NA>  M <NA>
5 Complaint      <NA>  F <NA>
6  Moderate      <NA>  M <NA>


## using na.strings
dat2 <- read.csv("data2.csv", header=T, na.strings="")
head(dat2)
  mon hr        acc   alc sex spd axles door  reg                                 cond1 drug1
1   8 21 No Control  TRUE   F   0     2    2 <NA> Physical Impairment (Eyes, Ear, Limb)     A
2   7 20 No Control FALSE   M 900     2    2 <NA>                           Inattentive     D
3   3  9 No Control FALSE   F 100     2    2 2004                                Normal     D
4   1 15 No Control FALSE   M   0     2    2 <NA> Physical Impairment (Eyes, Ear, Limb)     D
5   4 21 No Control FALSE      25    NA   NA <NA>                                  <NA>     D
6   4 20 No Control    NA   F  30     2    4 <NA>           Drinking Alcohol - Impaired     D
       inj1 PED_STATE st rac1
1     Fatal        NA  F   NA
2  Moderate        NA  F   NA
3  Moderate        NA  M   NA
4 Complaint        NA  M   NA
5 Complaint        NA  F   NA
6  Moderate        NA  M   NA
r na
10个回答
84
投票

我假设你在谈论第5行“性”。可能是这样的情况:在data2.csv文件中,单元格包含一个空格,因此不被R视为空。

另外,我注意到在第5行“axles”和“door”中,从data2.csv读取的原始值是字符串“NA”。您可能也希望将它们视为na.strings。去做这个,

dat2 <- read.csv("data2.csv", header=T, na.strings=c("","NA"))

编辑:

我下载了你的data2.csv。是的,第5行“性别”中有一个空格。所以你要

na.strings=c(""," ","NA")

-2
投票

通过从q中的dplyr安装来调用cran

library(dplyr)

(file)$(colname)<-sub("-",NA,file$colname) 

它会将特定列中的所有空白单元格转换为NA

如果列包含“ - ”,“”,0这样根据空白单元格的类型在代码中更改它

例如。如果我得到一个像“”而不是“ - ”的空白单元格,那么使用此代码:

(file)$(colname)<-sub("", NA, file$colname)

29
投票

您可以使用gsub将空的多个突变(如“”或空格)替换为NA:

data= data.frame(cats=c('', ' ', 'meow'), dogs=c("woof", " ", NA))
apply(data, 2, function(x) gsub("^$|^ $", NA, x))

19
投票

使用dplyr的眼睛更友好的解决方案是

require(dplyr)

## fake blank cells
iris[1,1]=""

## define a helper function
empty_as_na <- function(x){
    if("factor" %in% class(x)) x <- as.character(x) ## since ifelse wont work with factors
    ifelse(as.character(x)!="", x, NA)
}

## transform all columns
iris %>% mutate_each(funs(empty_as_na)) 

要将更正应用于列的子集,您可以使用dplyr的列匹配语法指定感兴趣的列。例如:mutate_each(funs(empty_as_na), matches("Width"), Species)

如果您的表包含日期,您应该考虑使用more typesafe版本的ifelse


9
投票

我最近遇到了类似的问题。这对我有用,如果变量是数字,那么简单的df$Var[df$Var == ""] <- "NA"就足够了。但是如果变量是一个因子,那么你需要先将它转换为字符,然后用你想要的值替换""单元格,然后将其转换回因子。举个例子,你的性别变量,我认为这将是一个因素,如果你想要替换空单元格,我会做以下事情:

df$Var <- as.character(df$Var)
df$Var[df$Var==""] <- "NA"
df$Var <- as.factor(df$Var)

5
投票

这应该可以解决问题

dat <- dat %>% mutate_all(na_if,"")

3
投票

如果您使用避免或外部包来读取外部文件,我的函数会考虑因子,字符向量和潜在属性。它还允许匹配不同的自定义na.strings。要转换所有列,只需使用lappy:df[] = lapply(df, blank2na, na.strings=c('','NA','na','N/A','n/a','NaN','nan'))

查看更多评论:

#' Replaces blank-ish elements of a factor or character vector to NA
#' @description Replaces blank-ish elements of a factor or character vector to NA
#' @param x a vector of factor or character or any type
#' @param na.strings case sensitive strings that will be coverted to NA. The function will do a trimws(x,'both') before conversion. If NULL, do only trimws, no conversion to NA.
#' @return Returns a vector trimws (always for factor, character) and NA converted (if matching na.strings). Attributes will also be kept ('label','labels', 'value.labels').
#' @seealso \code{\link{ez.nan2na}}
#' @export
blank2na = function(x,na.strings=c('','.','NA','na','N/A','n/a','NaN','nan')) {
    if (is.factor(x)) {
        lab = attr(x, 'label', exact = T)
        labs1 <- attr(x, 'labels', exact = T)
        labs2 <- attr(x, 'value.labels', exact = T)

        # trimws will convert factor to character
        x = trimws(x,'both')
        if (! is.null(lab)) lab = trimws(lab,'both')
        if (! is.null(labs1)) labs1 = trimws(labs1,'both')
        if (! is.null(labs2)) labs2 = trimws(labs2,'both')

        if (!is.null(na.strings)) {
            # convert to NA
            x[x %in% na.strings] = NA
            # also remember to remove na.strings from value labels 
            labs1 = labs1[! labs1 %in% na.strings]
            labs2 = labs2[! labs2 %in% na.strings]
        }

        # the levels will be reset here
        x = factor(x)

        if (! is.null(lab)) attr(x, 'label') <- lab
        if (! is.null(labs1)) attr(x, 'labels') <- labs1
        if (! is.null(labs2)) attr(x, 'value.labels') <- labs2
    } else if (is.character(x)) {
        lab = attr(x, 'label', exact = T)
        labs1 <- attr(x, 'labels', exact = T)
        labs2 <- attr(x, 'value.labels', exact = T)

        # trimws will convert factor to character
        x = trimws(x,'both')
        if (! is.null(lab)) lab = trimws(lab,'both')
        if (! is.null(labs1)) labs1 = trimws(labs1,'both')
        if (! is.null(labs2)) labs2 = trimws(labs2,'both')

        if (!is.null(na.strings)) {
            # convert to NA
            x[x %in% na.strings] = NA
            # also remember to remove na.strings from value labels 
            labs1 = labs1[! labs1 %in% na.strings]
            labs2 = labs2[! labs2 %in% na.strings]
        }

        if (! is.null(lab)) attr(x, 'label') <- lab
        if (! is.null(labs1)) attr(x, 'labels') <- labs1
        if (! is.null(labs2)) attr(x, 'value.labels') <- labs2
    } else {
        x = x
    }
    return(x)
}

1
投票

虽然上面的许多选项运作良好,但我发现非目标变量强制对chr有问题。在ifelse中使用grepllapply可以解决这种脱靶效应(在有限的测试中)。在grepl中使用slarky的正则表达式:

set.seed(42)
x1 <- sample(c("a","b"," ", "a a", NA), 10, TRUE)
x2 <- sample(c(rnorm(length(x1),0, 1), NA), length(x1), TRUE)

df <- data.frame(x1, x2, stringsAsFactors = FALSE)

胁迫角色的问题:

df2 <- lapply(df, function(x) gsub("^$|^ $", NA, x))
lapply(df2, class)

$ x1 [1]“character”

$ x2 [1]“character”

使用ifelse的解决方案:

df3 <- lapply(df, function(x) ifelse(grepl("^$|^ $", x)==TRUE, NA, x))
lapply(df3, class)

$ x1 [1]“character”

$ x2 [1]“数字”


1
投票

你也可以在mutate_at中使用dplyr

dat <- dat %>%
mutate_at(vars(colnames(.)),
        .funs = funs(ifelse(.=="", NA, as.character(.))))

选择要更改的单个列:

dat <- dat %>%
mutate_at(vars(colnames(.)[names(.) %in% c("Age","Gender")]),
        .funs = funs(ifelse(.=="", NA, as.character(.))))

选择要跳过的单个列:

dat <- dat %>%
mutate_at(vars(colnames(.)[!names(.) %in% c("Birthday")]),
        .funs = funs(ifelse(.=="", NA, as.character(.))))

0
投票

你不能用

dat <- read.csv("data2.csv",na.strings=" ",header=TRUE)

应该将所有空白转换为NA,因为读入数据时一定要在引号之间加一个空格

© www.soinside.com 2019 - 2024. All rights reserved.