因此,我使用Extra Trees Classifier来查找数据集中的要素重要性,它由13列和大约1000万行组成。我在上面放了一个椭圆形的信封,隔离林,一切都很好,它甚至还不到10 GB。我在jupyter笔记本上运行了代码,即使将其设置为low_memory = True,它也会给我带来内存错误。我尝试了拥有大约25GB内存的Google COlab,但仍然崩溃了,我现在非常困惑。
代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import ExtraTreesClassifier
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
# Loading First Dataframe
link = '...'
fluff, id = link.split('=')
print (id) # Verify that you have everything after '='
downloaded = drive.CreateFile({'id':id})
downloaded.GetContentFile('Final After Simple Filtering.csv')
df = pd.read_csv('Final After Simple Filtering.csv',index_col=None,low_memory=True)
#df = df.astype(float)
ExtraT = ExtraTreesClassifier(n_estimators = 100,bootstrap=False,n_jobs=1)
y=df['Power_kW']
del df['Power_kW']
X=df
ExtraT.fit(X,y)
feature_importance = ExtraT.feature_importances_
feature_importance_normalized = np.std([tree.feature_importances_ for tree in ExtraT.estimators_], axis = 1)
plt.bar(X.columns, feature_importance)
plt.xlabel('Lable')
plt.ylabel('Feature Importance')
plt.title('Parameters Importance')
plt.show()
谢谢