我正在通过尝试使用 R 和 missRanger 来插补必须为整数的缺失变量来学习插补。但是,我收到以下错误:
## Error: Assigned data `if (...) NULL` must be compatible with existing data.
## i Error occurred for column `beds`.
## x Can't convert from <double> to <integer> due to loss of precision.
## * Locations: 1, 2.
似乎我无法估算整数值,但如果我先将它们设为小数,我就可以。
这是一个表示:
library(tidyverse)
library(missRanger)
# Here is a sample of the data
reprex_df
## # A tibble: 9 x 5
## beds baths garages price property_type
## <int> <int> <int> <int> <chr>
## 1 NA NA NA 770000 house
## 2 2 1 0 300000 apartment
## 3 2 2 2 735000 apartment
## 4 NA NA NA 550000 apartment
## 5 4 2 3 500000 house
## 6 2 1 0 400000 apartment
## 7 4 2 2 607000 house
## 8 3 2 2 590000 house
## 9 4 1 2 710000 house
# Try to impute missing bedrooms
imputed <- reprex_df %>%
missRanger()
##
## Missing value imputation by random forests
##
## Variables to impute: beds, baths, garages
## Variables used to impute: beds, baths, garages, price, property_type
## iter 1:
## Error: Assigned data `if (...) NULL` must be compatible with existing data.
## i Error occurred for column `beds`.
## x Can't convert from <double> to <integer> due to loss of precision.
## * Locations: 1, 2.
# Convert integers to numerics and try again
imputed2 <- reprex_df %>%
mutate_if(is.integer,
as.numeric) %>%
missRanger()
##
## Missing value imputation by random forests
##
## Variables to impute: beds, baths, garages
## Variables used to impute: beds, baths, garages, price, property_type
## iter 1: ...
## iter 2: ...
## iter 3: ...
## iter 4: ...
## iter 5: ...
# That works, but decimal rooms don't make sense
imputed2
## # A tibble: 9 x 5
## beds baths garages price property_type
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 3.44 1.86 2.15 770000 house
## 2 2 1 0 300000 apartment
## 3 2 2 2 735000 apartment
## 4 2.77 1.83 1.84 550000 apartment
## 5 4 2 3 500000 house
## 6 2 1 0 400000 apartment
## 7 4 2 2 607000 house
## 8 3 2 2 590000 house
## 9 4 1 2 710000 house
如何使用 missRanger 估算缺失的整数?
将数据集称为“reprex”并不会使示例可重现......
由于
missRanger
无法改变 tibble 内部对类型转换的反应方式,这里有两个建议:
在调用 missRanger 之前将 tibble 转换为 data.frame 或(这是我最喜欢的)
使用参数
pmm.k
在迭代之间使用预测均值匹配。这具有用现实值填补空白的良好副作用。整数将保持整数等
missRanger
的小插图解释了这些概念,请参阅https://cran.r-project.org/web/packages/missRanger/index.html
免责声明:我是
missRanger
的包维护者。
library(missRanger)
library(tidyverse)
# Example data
mtcars2 <- mtcars %>%
as_tibble() %>%
mutate(cyl = as.integer(cyl)) %>%
generateNA()
missRanger(mtcars2, pmm.k = 3, seed = 153)
# Gives
# # A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 21 6 160 105 3.9 2.62 16.5 0 1 4 4
# 21 6 160 110 3.9 2.88 17.0 0 1 4 4
# 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 21.4 6 258 110 3.08 3.22 19.4 1 0 3 2