找到返回最高精度的阈值

问题描述 投票:0回答:1

我有这样的数据:

(26.5625,0)
(29.5625,0)
(30.390625,0)
(18.640625,0)
(27.984375,0)
(26.984375,0)
(25.703125,0)
(25.78125,0)
(32.09375,0)
(25.59375,0)
(27.703125,0)
(30.828125,0)
(23.578125,0)
(21.890625,0)
(25.734375,0)
(24.65625,0)
(27.46875,0)
(31.640625,0)
(26.53125,0)
(25.078125,0)
(30.65625,0)
(24.515625,0)
(25.21875,0)
(21.78125,0)
(28.984375,0)
(29.765625,0)
(27.171875,1)
(30.46875,1)
(35.3125,1)
(27.90625,1)
(34.9375,1)
(33.4375,1)
(30.90625,1)
(31.671875,1)
(32.40625,1)
(26.078125,1)
(31.171875,1)
(36.21875,1)
(35.0625,1)
(35.65625,1)
(36.65625,1)
(37.96875,1)
(31.953125,1)
(33.15625,1)
(37.34375,1)

对应精度的排序为:

ordered_labels: [1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

average precision: 0.7338

我试图找到返回最高精度的阈值(例如 27.0)(在这种情况下为 0.7338)。我尝试过逻辑回归,但返回的阈值为“0.7”,而不是 27.0 等数字。 对于此类数据,我应该使用线性回归还是支持向量机?

我的输出:(代码如下)

Precision: [0.33333333 0.         0.         1.        ]
Recall: [1. 0. 0. 0.]
Threshold: [0.13154558 0.7006058  0.72969373]

这是我正在使用的代码:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
from ast import literal_eval

# Create a simple dataset
scores_labels_path = 'data.txt'
X, y = [], []
with open(scores_labels_path) as file:
    for line in file:
        line = literal_eval(line.rstrip())
        X.append(line[0])
        y.append(line[1])

X = np.array(X).reshape(-1, 1)
y = np.array(y)
# X1, y1 = make_classification(n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=7)
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_scores = lr.predict_proba(X_test)
precision, recall, threshold = precision_recall_curve(y_test, y_scores[:, 1])

print("Precision: {}".format(precision))
print("Recall: {}".format(recall))
print("Threshold: {}".format(threshold))
python numpy machine-learning scikit-learn linear-regression
1个回答
0
投票

问题是您正在使用逻辑回归模型,一种概率模型。您获得的阈值不是您的输入特征,而是逻辑回归模型输出的概率。

无需机器学习模型即可实现:

    按特征值对数据进行排序。
  1. 对于每个唯一值,计算将所有具有特征值
  2. <=
     的点分类到当前值作为 
    class 0 并将所有具有特征值 >
     当前值的点分类为 
    class 1 的精度。
  3. 给出最高精度的特征值就是你的阈值。
实施:

import numpy as np import pandas as pd data = pd.read_csv('data.txt', header=None, names=['feature', 'label']) data = data.sort_values('feature') precisions= [] threshold = data['feature'].unique() for threshold in thresholds: predicted_labels = np.where(data['feature'] <= threshold, 0, 1) tp = np.sum((predicted_labels == 1) & (data['label'] == 1)) fp = np.sum((predicted_labels == 1) & (data['label'] == 0)) precision = tp / (tp + fp) precisions.append(precision) max_precision_index = np.argmax(precisions) best_threshold = thresholds[max_precision_index] print("Best threshold: {}".format(best_threshold))
    
© www.soinside.com 2019 - 2024. All rights reserved.