与fancyimpute和熊猫数据的插补

问题描述 投票:13回答:3

我有一个大的熊猫数据成名df。它有相当多的missings。删除行/或COL-明智的是不是一种选择。归咎于中位数,是指或最频繁的值不是一个选项是(因此与pandas和/或scikit不幸好好尝试一下做的伎俩归集)。

我遇到什么似乎是一个被称为整洁的包装fancyimpute来(你可以找到它here)。但是,我有一些问题吧。

这是我做的:

#the neccesary imports
import pandas as pd
import numpy as np
from fancyimpute import KNN

# df is my data frame with the missings. I keep only floats
df_numeric = = df.select_dtypes(include=[np.float])

# I now run fancyimpute KNN, 
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))

然而,df_filled是单个载体以某种方式,而不是填充数据帧。如何获得与插补数据帧的联系呢?

Update

我意识到,fancyimpute需要numpay array。我使用df_numeric因此转换的as_matrix()到一个阵列。

# df is my data frame with the missings. I keep only floats
df_numeric = df.select_dtypes(include=[np.float]).as_matrix()

# I now run fancyimpute KNN, 
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))

输出与失踪列标签一个数据帧。任何方法来检索标签?

python python-3.x pandas imputation fancyimpute
3个回答
2
投票
df=pd.DataFrame(data=mice.complete(d), columns=d.columns, index=d.index)

由该fancyimpute对象的np.array方法返回的.complete()(无论是小鼠或KNN)被供给作为熊猫数据帧,其COLS和索引中的相同的原始数据帧的内容(argument data=)


6
投票

添加以下行代码后:

df_filled.columns = df_numeric.columns
df_filled.index = df_numeric.index

4
投票

我看到花哨转嫁给和熊猫的无奈。下面是一个使用递归的方法覆盖一个相当基本的包装。发生在和输出数据框 - 列名不变。这些类型的包装与管道工作。

from fancyimpute import SoftImpute

class SoftImputeDf(SoftImpute):
    """DataFrame Wrapper around SoftImpute"""

    def __init__(self, shrinkage_value=None, convergence_threshold=0.001,
                 max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero",
                 min_value=None,max_value=None,normalizer=None,verbose=True):

        super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value, 
                                           convergence_threshold=convergence_threshold,
                                           max_iters=max_iters,max_rank=max_rank,
                                           n_power_iterations=n_power_iterations,
                                           init_fill_method=init_fill_method,
                                           min_value=min_value,max_value=max_value,
                                           normalizer=normalizer,verbose=False)



    def fit_transform(self, X, y=None):

        assert isinstance(X, pd.DataFrame), "Must be pandas dframe"

        for col in X.columns:
            if X[col].isnull().sum() < 10:
                X[col].fillna(0.0, inplace=True)

        z = super(SoftImputeDf, self).fit_transform(X.values)
        return pd.DataFrame(z, index=X.index, columns=X.columns)


2
投票

我真的很感激@ jander081的做法,并在其上扩大一点点处理设置类别列。我有一个问题,即在分类栏会得到取消设置和创建培训过程中出现错误,所以修改了代码如下:

from fancyimpute import SoftImpute
import pandas as pd

class SoftImputeDf(SoftImpute):
    """DataFrame Wrapper around SoftImpute"""

    def __init__(self, shrinkage_value=None, convergence_threshold=0.001,
                 max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero",
                 min_value=None,max_value=None,normalizer=None,verbose=True):

        super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value, 
                                           convergence_threshold=convergence_threshold,
                                           max_iters=max_iters,max_rank=max_rank,
                                           n_power_iterations=n_power_iterations,
                                           init_fill_method=init_fill_method,
                                           min_value=min_value,max_value=max_value,
                                           normalizer=normalizer,verbose=False)



    def fit_transform(self, X, y=None):

        assert isinstance(X, pd.DataFrame), "Must be pandas dframe"

        for col in X.columns:
            if X[col].isnull().sum() < 10:
                X[col].fillna(0.0, inplace=True)

        z = super(SoftImputeDf, self).fit_transform(X.values)
        df = pd.DataFrame(z, index=X.index, columns=X.columns)
        cats = list(X.select_dtypes(include='category'))
        df[cats] = df[cats].astype('category')

        # return pd.DataFrame(z, index=X.index, columns=X.columns)
        return df

© www.soinside.com 2019 - 2024. All rights reserved.