如何用大数据集训练随机森林分类器以避免Python中的内存错误？

Question

我有一个包含 3000 万行的数据集。我有两列：一列包含 1 或 0 标签，另一列包含每行 1280 个特征的列表（总共 181 GB）。我想做的就是将这个数据集插入随机森林算法，但是内存耗尽并且崩溃（我尝试使用 400 GB 的内存，但它仍然崩溃）。

加载数据集后，我必须对其进行一些操作，因为它采用 Huggingface 箭头格式：https://huggingface.co/docs/datasets/en/about_arrow（我怀疑这占用了大量 RAM ).

我知道我可以对数据集进行一些降维，但是我应该对代码进行任何更改以减少 RAM 使用吗？

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, auc
from datasets import load_dataset, Dataset

# Load dataset
df = Dataset.from_file("data.arrow")
df = pd.DataFrame(df)
X = df['embeddings'].to_numpy() # Convert Series to NumPy array
X = np.array(X.tolist()) # Convert list of arrays to a 2D NumPy array
X = X.reshape(X.shape[0], -1) # Flatten the 3D array into a 2D array
y = df['labels']

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the random forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
rf_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)

# Evaluate the classifier

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate AUC score
auc_score = roc_auc_score(y_test, y_pred)
print("AUC Score:", auc_score)

with open("metrics.txt", "w") as f:
    f.write("Accuracy: " + str(accuracy) + "\n")
    f.write("AUC Score: " + str(auc_score))
    
# Make predictions on the test set
y_pred_proba = rf_classifier.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")

# Save ROC curve plot to an image file
plt.savefig('roc_curve.png')

# Close plot to free memory

Answer 1

一些想法：

模型组合

训练 N 个模型（您必须选择 N 个模型，具体取决于 RAM 使用情况），每个模型仅训练数据的单独部分。

然后对模型进行融合，对每个模型使用

predict_proba(x)

方法进行推理并计算平均预测。

这可能比单个模型有更好/更差/相同的精度，如果 N 不是很大，应该不会有太大影响。

scikit learn 的分支

Fork scikit 学习并用自定义循环替换

输入训练数据上的每个循环，从磁盘而不是 RAM 加载数据。

这是很难或非常困难、漫长的方法，我不确定你在路上会遇到什么问题。就难度而言，更糟糕的是只能从头开始编写 RF。

其他想法

可以通过减少 max_depth、n_estimators、max_features 等来降低 RAM 使用量。请注意，这些会影响您的模型准确性（也许是积极的方式！但要知道这一点，您必须比较结果...）

float32 (data.astype(np.float32)

)，或者如果正确缩放+转换，甚至可能为

int16？

也许你会发现这很有用。如果存在彼此非常接近的样本（我留给您的距离度量），请计算这些样本的平均值并将其替换为该平均值。还给这个样本sample_weight = number_of_averaged_samples
删除低重要性特征 - 在数据集的随机部分上训练模型并查看模型feature_importances
```
。然后下次加载数据集时忽略它们。
```

如何用大数据集训练随机森林分类器以避免Python中的内存错误？

问题描述投票：0回答：2

2个回答

模型组合

scikit learn 的分支

其他想法

最新问题

如何用大数据集训练随机森林分类器以避免Python中的内存错误？

问题描述 投票：0回答：2

2个回答

模型组合

scikit learn 的分支

其他想法

最新问题

问题描述投票：0回答：2