准确度分数：预测中包含的真实变化。加快代码速度

Question

我正在运行一个多标签预测模型。作为一项性能衡量标准，我正在检查模型中最重要的

预测是否包含

y=1

的真实情况。

例如，如果我的模型对某个数据点的最高预测是黄色（90%）、绿色（80%）、红色（75%），而现实是绿色和红色，我将其视为“正确”预测，而诸如（精确）准确性之类的衡量标准将被视为不正确。

下面是我的实现，其中有一个大型 X 和 y 矩阵（具有许多列）的实际示例。

但是，我的实现速度太慢了（在我的笔记本电脑上大约需要 2 分钟），我真的很感激有关加快速度的提示或完全不同的解决方案。谢谢！

from scipy.sparse import random
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import time

np.random.seed(14)

## Generate sparse X, and y
X = random(100_000, 1000, density=0.01, format='csr')
y = pd.DataFrame(np.random.choice([0, 1], size=(100_000, 10)))
# Define no change as 0 in all rows
y['no_change'] = np.where(y.sum(axis=1) == 0, 1, 0)

dt = DecisionTreeClassifier(max_depth=15)
dt.fit(X, y)

# Print precise accuracy -- truth must precisely match prediction
print(f"Accuracy score (precise): {accuracy_score(y_true=y, y_pred=dt.predict(X=X)):.1%}")

# Get top n predictions based on probability (in case of equality keep all)
def top_n_preds(row, n_top):
    topcols = row[row > 0].nlargest(n=n_top, keep='all')
    top_colnames = topcols.index.tolist()
    return top_colnames

start = time.time()
# Retrieve probabilities of predictions
pred_probs = np.asarray(dt.predict_proba(X=X))
pred_probs = pd.DataFrame(pred_probs[:, :, 1].T, columns=y.columns)

# Find top 5 predictions
pred_probs['top_preds'] = pred_probs.apply(top_n_preds, axis=1, n_top=5)
# List all real changes in y
pred_probs['real_changes'] = y.apply(lambda row: row[row == 1].index.tolist(), axis=1)
# Check if real changes are contained in top 5 predictions
pred_probs['preds_cover_reality'] = pred_probs.apply(lambda row: set(row['real_changes']).issubset(set(row['top_preds'])), axis=1)

print(f"Accuracy present in top n_top predictions: {pred_probs['preds_cover_reality'].sum() / y.shape[0]:.1%}")
print(f"Time elapsed: {(time.time()-start)/60:.1f} minutes")

Answer 1

apply()

没那么快，尽可能避免

apply()

并使用

numpy

操作来代替：

np.random.seed(14)
X = random(100_000, 1000, density=0.01, format='csr')
y = np.random.choice([0, 1], size=(100_000, 10))
y_no_change = np.where(y.sum(axis=1) == 0, 1, 0)
y = np.c_[y, y_no_change]

dt = DecisionTreeClassifier(max_depth=15)
dt.fit(X, y)

print(f"Accuracy score (precise): {accuracy_score(y_true=y, y_pred=dt.predict(X=X)):.1%}")

start = time.time()

# Retrieving probabilities
pred_probs = np.array([tree.predict_proba(X)[:, 1] for tree in dt.estimators_]).T

# Find top 5 predictions
top_5_idx = np.argpartition(-pred_probs, 5, axis=1)[:, :5]

# Prepare the ground truth
true_labels_idx = np.argwhere(y == 1)

# Check if real changes are contained in top 5 predictions
counts = 0
for i in range(y.shape[0]):
    truth_set = set(true_labels_idx[true_labels_idx[:, 0] == i][:, 1])
    if truth_set.issubset(top_5_idx[i]):
        counts += 1

print(f"Accuracy present in top 5 predictions: {counts / y.shape[0]:.1%}")
print(f"Time elapsed: {(time.time() - start)/60:.1f} minutes")

准确度分数：预测中包含的真实变化。加快代码速度

问题描述投票：0回答：1

1个回答

最新问题

准确度分数：预测中包含的真实变化。加快代码速度

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1