Missforest fit_transform 缺失值插补耗时过长

问题描述 投票:0回答:0

我这里有一个数据集,但是文件太大了,我不得不把它分成多个“output.csv”文件。上传到这个 github 上的部分大约是整个数据集的三分之一,但它可以让您了解数据:https://github.com/chongochoo/dataset

我正在尝试估算“suspect_age”的缺失值。我准备数据框进行插补,但是当我调用它时:

# Miss Forest
imputer = MissForest() #miss forest
X_imputed = imputer.fit_transform(df_fitted)
X_imputed = pd.DataFrame(X_imputed, columns = df_fitted.columns).round(1)

对 fit_transform() 的调用花费了非常非常长的时间,我什至无法到达获得估算值的地步。不知道如何解决这个问题,因为我不认为我的数据集如此复杂,实际上它是那些 Kaggle 玩具数据集之一。

这里是总代码:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
#sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
import sklearn.neighbors._base
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
#from sklearn.neighbors import  KNeighborsClassifier
from missingpy import MissForest
#from sklearn.impute import KNNImputer


df = pd.read_csv('C:/Users/Connie/PycharmProjects/507-FinalProject/data-preprocessed10.csv', sep=',')

#print(100*(df.isnull().sum())/len(df.index))


# drop all columns no info
df_fitted = df.drop(['incident_id','address','incident_url','source_url','incident_url_fields_missing',
                     'congressional_district', 'gun_stolen', 'gun_type', 'incident_characteristics',
                     'latitude', 'longitude', 'location_description', 'notes', 'participant_age',
                     'participant_age_group', 'participant_gender', 'participant_name', 'participant_relationship',
                     'participant_status', 'participant_type', 'sources', 'state_house_district',
                     'state_senate_district', 'zipcode', 'city_or_county', 'suspect_gender'],axis=1)

df_fitted['year'] = ''
df_fitted['month'] = ''
df_fitted['day'] = ''

for i,row in df_fitted.iterrows():
    date_parts = str(row['date']).split('-')
    df_fitted.at[i, 'year'] = date_parts[0]
    df_fitted.at[i, 'month'] = date_parts[1]
    df_fitted.at[i, 'day'] = date_parts[2]

df_fitted = df_fitted.drop(columns=['date'])
#df_fitted['suspect_gender'] = df_fitted.suspect_gender.map({'Male':0, 'Female':1, 'Gender Unknown':2})
df_fitted['suspect_age'] = df_fitted.suspect_age.map({'Adult 18+':0, 'Teen 12-17':1, 'Child 0-11':2, '':3})
df_fitted = pd.get_dummies(df_fitted, columns=['state'], drop_first=True)


# Miss Forest
imputer = MissForest() #miss forest
X_imputed = imputer.fit_transform(df_fitted)
X_imputed = pd.DataFrame(X_imputed, columns = df_fitted.columns).round(1)


#df_transform = imputer.fit_transform(df_fitted[['suspect_age']])

print(X_imputed.head(5))
python scikit-learn data-science random-forest sklearn-pandas
© www.soinside.com 2019 - 2024. All rights reserved.