我正在尝试对不平衡数据进行过采样并为数据科学俱乐部计算混淆矩阵。我还附上了数据集的链接:https://www.kaggle.com/datasets/ealaxi/paysim1/data。
这是代码:
import pandas as pd
import numpy as np
from collections import Counter
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler # Add this import
import seaborn as sns
import matplotlib.pyplot as plt
# Read the dataset
data = pd.read_csv("Imbalanced_Data_Set.csv")
# Separate features and target variable
X = data.drop(columns=['type', 'nameOrig', 'nameDest', 'isFraud'])
y = data['isFraud']
def robust_scale(x):
median_val = np.median(x)
iqr = np.percentile(x, 75) - np.percentile(x, 25)
scaled_data = (x - median_val) / iqr
return scaled_data
# Assuming X is a DataFrame with 'step' and 'amount' columns
X['step'] = robust_scale(X['step'])
X['amount'] = robust_scale(X['amount'])
# Apply oversampling
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)
# Train a classifier (Example: Logistic Regression)
clf = LogisticRegression()
clf.fit(X_train, y_train)
# Predict on the test set
y_pred = clf.predict(X_test)
# Create confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Plot confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
我尝试使用 Randomoversample 方法对数据集进行过采样,并且我还尝试缩放数量和时间。
我不确定这个混淆矩阵是否正确,因为 TP 和 FN 一定在极端范围内。
我还需要进行逻辑回归。
这是我现在所拥有的:
非常感谢您的帮助。