KNN imputer 具有标称、序数和数值变量

问题描述 投票:0回答:0

我有以下数据:

# Libraries
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.metrics.pairwise import nan_euclidean_distances

# Data set
toy_example = pd.DataFrame(data = {"Color": ["Blue", "Red", "Green", "Blue", np.nan],
                                   "Size": ["S", "M", "L", np.nan, "S"],
                                   "Weight": [10, np.nan, 15, 12, np.nan],
                                   "Age": [2, 4, np.nan, 3, 1]})
toy_example

我想估算变量

Color
(名义),
Size
(序数),
Weight
(数值)和
Age
(数值),我想使用KNN imputer使用距离度量
nan_euclidean
来自
 sklearn.impute.KNNImputer

我现在需要先预处理数据。因此我想出了以下2个解决方案

一个。 One hot encoding for the nominal variable where

NaN
values are encoded as a category

# Preprocessing the data
color_encoder = OneHotEncoder()
color_encoder.fit(X=toy_example[["Color"]])
## Checking categories and names
### A na dummy is included by default
color_encoder.categories_
color_encoder.get_feature_names_out()

# Create a new DataFrame with the one-hot encoded "Color" column
color_encoded = pd.DataFrame(color_encoder.transform(toy_example[["Color"]]).toarray(),
                             columns=color_encoder.get_feature_names_out(["Color"]))
color_encoded

# Create a dictionary to map the ordinal values of the "Size" column to numerical values
size_map = {"S": 1, "M": 2, "L": 3}
size_map
toy_example["Size"] = toy_example["Size"].map(size_map)

# Concatenate encoded variables with numerical variables
preprocessed_data = pd.concat([color_encoded, toy_example[["Size", "Weight", "Age"]]], 
                              axis=1)
preprocessed_data

## Matrix of euclidean distances
matrix_nan_euclidean = nan_euclidean_distances(X=preprocessed_data)
matrix_nan_euclidean

# Perform nearest neighbors imputation
imputer = KNNImputer(n_neighbors=2)
imputed_df = pd.DataFrame(imputer.fit_transform(preprocessed_data), 
                          columns=preprocessed_data.columns)
## Here I have a problem where the NaN value in the variable
## "Color" in relation to the 5th row is not imputed
### I was expecting a 0 in the Color_nan and a positive value
### in any of the columns Color_Blue, Color_Green, Color_Red
imputed_df 

正如我在代码注释中提到的,此解决方案对于标称变量的情况不可行,因为我在未估算标称变量的情况下获得以下结果:

   Color_Blue  Color_Green  Color_Red  Color_nan  Size  Weight  Age
0         1.0          0.0        0.0        0.0   1.0    10.0  2.0
1         0.0          0.0        1.0        0.0   2.0    13.5  4.0
2         0.0          1.0        0.0        0.0   3.0    15.0  2.5
3         1.0          0.0        0.0        0.0   1.5    12.0  3.0
4         0.0          0.0        0.0        1.0   1.0    12.5  1.0

对于序数变量的情况,至少在我需要决定要应用的适当舍入方法(经典舍入、上限或下限)的地方估算值

b。 One hot encoding for the nominal variable where the

NaN
values not encoded as a category and the rest of the dummy variables are considered
NaN

# Preprocessing the data
color_encoder = OneHotEncoder()
color_encoder.fit(X=toy_example[["Color"]])
## Checking categories and names
### A na dummy is included by default
color_encoder.categories_
color_encoder.get_feature_names_out()

# Create a new DataFrame with the one-hot encoded "Color" column
color_encoded = pd.DataFrame(color_encoder.transform(toy_example[["Color"]]).toarray(),
                             columns=color_encoder.get_feature_names_out(["Color"]))
color_encoded
## Don't take into account the nan values as a separate category 
color_encoded = color_encoded.loc[:, "Color_Blue":"Color_Red"]
## Because I don't know in advance the values of the dummy variables
## I will replace them with NaN values which is a logical solution taking
## into account that I don't know the value of this observation in relation
## to the "Color" variable
color_encoded.iloc[4, :] = np.nan
color_encoded

# Create a dictionary to map the ordinal values of the "Size" column to numerical values
size_map = {"S": 1, "M": 2, "L": 3}
size_map
toy_example["Size"] = toy_example["Size"].map(size_map)

# Concatenate encoded variables with numerical variables
preprocessed_data = pd.concat([color_encoded, toy_example[["Size", "Weight", "Age"]]], 
                              axis=1)
preprocessed_data

## Matrix of euclidean distances
matrix_nan_euclidean = nan_euclidean_distances(X=preprocessed_data)
matrix_nan_euclidean

# Perform nearest neighbors imputation
imputer = KNNImputer(n_neighbors=2)
imputed_df = pd.DataFrame(imputer.fit_transform(preprocessed_data), 
                          columns=preprocessed_data.columns)
## Here I have a problem because I will need to decide
## how to round the values using classical rounding, 
## ceiling or floor in relation to the 5th row. However 
## any of this methods are inconsistent because an 
## observation cannot be Blue and Green at the same time 
## but it needs to be at least Blue, Green or Red
imputed_df

正如我在代码注释中提到的,此解决方案对于标称变量的情况不可行,因为我获得了以下结果,其中标称变量取 2 个值或不取任何值:

   Color_Blue  Color_Green  Color_Red  Size  Weight  Age
0         1.0          0.0        0.0   1.0    10.0  2.0
1         0.0          0.0        1.0   2.0    13.5  4.0
2         0.0          1.0        0.0   3.0    15.0  3.5
3         1.0          0.0        0.0   1.5    12.0  3.0
4         0.5          0.5        0.0   1.0    12.5  1.0

考虑到 a.和 b。不起作用,任何人都知道如何使用 多元插补以一致的方式插补名义变量?

那么,对于名义变量的情况,我如何使用

多元插补
来插补 toy_example 的观察结果?

python scikit-learn imputation
© www.soinside.com 2019 - 2024. All rights reserved.